Spaces:
Running
Running
Bhupen
commited on
Commit
·
ce707b9
1
Parent(s):
53b25e6
Add regression basics py file
Browse files- app.py +467 -0
- requirements.txt +5 -0
app.py
ADDED
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import matplotlib.pyplot as plt
|
5 |
+
import seaborn as sns
|
6 |
+
from sklearn.linear_model import LinearRegression
|
7 |
+
from sklearn.model_selection import train_test_split
|
8 |
+
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
|
9 |
+
from statsmodels.stats.outliers_influence import variance_inflation_factor
|
10 |
+
from sklearn.datasets import fetch_california_housing
|
11 |
+
from sklearn.model_selection import cross_validate
|
12 |
+
import time
|
13 |
+
from sklearn.model_selection import learning_curve
|
14 |
+
|
15 |
+
def show_intro():
|
16 |
+
with st.expander(f"➡️ What is Regression?"):
|
17 |
+
st.markdown("""
|
18 |
+
**Regression** is a fundamental statistical technique used to understand and quantify the relationship between a **dependent variable (what you want to predict)** and one or more **independent variables (predictors)**.
|
19 |
+
|
20 |
+
---
|
21 |
+
###### 🔍 Everyday Examples of Regression:
|
22 |
+
- 📈 Predicting **house prices** based on size, location, and number of bedrooms.
|
23 |
+
- 🎓 Estimating a student’s **final grade** based on hours of study and attendance.
|
24 |
+
- 🚗 Forecasting **fuel efficiency** based on engine size and weight of the car.
|
25 |
+
- 🧠 Predicting **IQ scores** or **height** based on parental traits (enter Galton! 👇)
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
###### 👨👩👧👦 Galton’s Theory – *Regression to the Mean*
|
30 |
+
Sir Francis Galton, a 19th-century statistician and cousin of Charles Darwin, studied the heights of parents and their children.
|
31 |
+
|
32 |
+
He observed:
|
33 |
+
- Very tall parents tended to have children **shorter** than themselves.
|
34 |
+
- Very short parents tended to have children **taller** than themselves.
|
35 |
+
|
36 |
+
🧠 He coined the term **"regression to the mean"**, which means:
|
37 |
+
> "Extreme traits tend to be followed by traits closer to the average in the next generation."
|
38 |
+
|
39 |
+
---
|
40 |
+
###### 👶 Real-Life Example:
|
41 |
+
- If both parents are exceptionally tall (say, 6'5"), their child is **likely tall**, but **closer to the average height** than the parents — maybe 6'2".
|
42 |
+
- Similarly, if parents are very short, the child’s height tends to “regress” toward the average population height.
|
43 |
+
|
44 |
+
This pattern **doesn't mean height is random**, just that genetics and environment **pull traits toward typical values** over time.
|
45 |
+
|
46 |
+
---
|
47 |
+
Regression models in ML extend this idea — instead of modeling parent-child height, we model **any continuous outcome** based on relevant input variables.
|
48 |
+
""")
|
49 |
+
|
50 |
+
with st.expander("➡️ Industry Use-Cases of Regression Models"):
|
51 |
+
st.markdown("""
|
52 |
+
###### 🏥 Healthcare
|
53 |
+
- 🔬 Estimating **patient recovery time** based on age, treatment type, and initial condition.
|
54 |
+
- 💉 Predicting **blood glucose levels** based on dietary habits and medication dosage.
|
55 |
+
- 🫀 Forecasting **hospital readmission rates** based on prior health records and discharge details.
|
56 |
+
|
57 |
+
###### 🛒 Retail
|
58 |
+
- 📦 Predicting **sales volume** based on pricing, seasonality, and promotional campaigns.
|
59 |
+
- 🛍️ Estimating **inventory demand** for specific SKUs using historical sales and trends.
|
60 |
+
- 👗 Forecasting **customer churn** likelihood using past purchase behavior and returns.
|
61 |
+
|
62 |
+
###### 🛍️ E-commerce
|
63 |
+
- 💸 Predicting **customer lifetime value (CLV)** based on purchase frequency and basket size.
|
64 |
+
- 🚚 Estimating **delivery time** based on warehouse location, item type, and order volume.
|
65 |
+
- 🧾 Forecasting **return probability** of products based on description, images, and reviews.
|
66 |
+
|
67 |
+
###### 💰 Finance
|
68 |
+
- 📊 Predicting **stock prices** or **bond yields** based on historical trends and market indicators.
|
69 |
+
- 🏦 Estimating **credit risk** or **loan default probability** using income, credit history, etc.
|
70 |
+
- 💳 Forecasting **spending patterns** on credit cards based on customer behavior.
|
71 |
+
|
72 |
+
###### 💊 Pharma & Life Sciences
|
73 |
+
- 🧪 Predicting **drug efficacy** based on dosage and patient demographics in clinical trials.
|
74 |
+
- 🦠 Estimating **disease progression** timelines based on early symptoms and test results.
|
75 |
+
- 💊 Forecasting **adverse drug reactions** from formulation and patient profiles.
|
76 |
+
""")
|
77 |
+
|
78 |
+
def simple_regression_example():
|
79 |
+
with st.expander("➡️ Single Variable Regression (Manual Calculation)"):
|
80 |
+
# Sample data
|
81 |
+
advertising_spend = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
|
82 |
+
sales_revenue = np.array([2.1, 3.9, 5.2, 6.0, 7.1, 8.1, 9.0, 10.2, 10.8, 12.0])
|
83 |
+
|
84 |
+
# Regression coefficients (manual calculation)
|
85 |
+
x_mean = np.mean(advertising_spend)
|
86 |
+
y_mean = np.mean(sales_revenue)
|
87 |
+
b1 = np.sum((advertising_spend - x_mean) * (sales_revenue - y_mean)) / np.sum((advertising_spend - x_mean)**2)
|
88 |
+
b0 = y_mean - b1 * x_mean
|
89 |
+
predicted_sales = b0 + b1 * advertising_spend
|
90 |
+
|
91 |
+
# Two-column layout
|
92 |
+
col1, col2 = st.columns(2)
|
93 |
+
|
94 |
+
with col1:
|
95 |
+
st.markdown("###### 📊 Sample Data")
|
96 |
+
df = pd.DataFrame({
|
97 |
+
'Advertising Spend (in lakhs)': advertising_spend,
|
98 |
+
'Sales Revenue (in lakhs)': sales_revenue
|
99 |
+
})
|
100 |
+
st.dataframe(df)
|
101 |
+
|
102 |
+
with col2:
|
103 |
+
st.markdown("###### 📉 Linear Regression Formula")
|
104 |
+
st.markdown(f"""
|
105 |
+
The linear regression equation is:
|
106 |
+
**Sales Revenue = {b0:.2f} + {b1:.2f} × Advertising Spend**
|
107 |
+
|
108 |
+
Where:
|
109 |
+
- **b₀ (Intercept)**: Sales revenue when advertising spend is zero.
|
110 |
+
- **b₁ (Slope)**: Increase in revenue for each additional lakh spent.
|
111 |
+
|
112 |
+
###### Formula for Computing Coefficients
|
113 |
+
- **b₁ (Slope)** = (Σ(xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
|
114 |
+
- **b₀ (Intercept)** = ȳ - b₁ × x̄
|
115 |
+
""")
|
116 |
+
|
117 |
+
# Plotting the regression line
|
118 |
+
fig, ax = plt.subplots(figsize=(9,4))
|
119 |
+
ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
|
120 |
+
ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
|
121 |
+
ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
|
122 |
+
ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
|
123 |
+
ax.set_title("Linear Regression: Advertising Spend vs Sales Revenue", fontsize=10)
|
124 |
+
ax.tick_params(axis='both', labelsize=8)
|
125 |
+
ax.legend()
|
126 |
+
st.pyplot(fig)
|
127 |
+
|
128 |
+
with st.expander("➡️ Predict Sales Revenue from Advertising Spend"):
|
129 |
+
st.markdown(f"Use the trained regression model to forecast expected sales revenue 📈")
|
130 |
+
|
131 |
+
user_input = st.number_input(
|
132 |
+
"Enter Advertising Spend (in lakhs)",
|
133 |
+
min_value=1.0,
|
134 |
+
max_value=20.0,
|
135 |
+
value=5.0,
|
136 |
+
step=0.5,
|
137 |
+
format="%.1f"
|
138 |
+
)
|
139 |
+
|
140 |
+
if user_input:
|
141 |
+
predicted_value = b0 + b1 * user_input
|
142 |
+
st.success(f"🔮 Predicted Sales Revenue: **{predicted_value:.2f} lakhs**")
|
143 |
+
|
144 |
+
# Visualize prediction on the regression chart
|
145 |
+
fig, ax = plt.subplots(figsize=(9,4))
|
146 |
+
ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
|
147 |
+
ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
|
148 |
+
|
149 |
+
# Add dashed lines for prediction
|
150 |
+
ax.axvline(x=user_input, color='red', linestyle='--', linewidth=1)
|
151 |
+
ax.axhline(y=predicted_value, color='red', linestyle='--', linewidth=1)
|
152 |
+
ax.plot(user_input, predicted_value, 'ro') # predicted point
|
153 |
+
|
154 |
+
ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
|
155 |
+
ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
|
156 |
+
ax.set_title("Prediction on Regression Line", fontsize=10)
|
157 |
+
ax.tick_params(axis='both', labelsize=8)
|
158 |
+
ax.legend()
|
159 |
+
st.pyplot(fig)
|
160 |
+
|
161 |
+
with st.expander("➡️ Key Takeaways ..."):
|
162 |
+
st.markdown("""
|
163 |
+
- 🔍 **Simplicity with Impact**: Even a simple linear model offers valuable foresight—linking investments (like ad spend) directly to outcomes (like sales revenue).
|
164 |
+
- 📊 **Data-Driven Decisions**: Enables leadership to make **objective** decisions, backed by quantitative evidence rather than gut feel.
|
165 |
+
- 🎯 **Budget Optimization**: Helps identify how much to invest to hit revenue targets—minimizing under or over-spending on campaigns.
|
166 |
+
- 📈 **Trend Insights**: Understanding whether returns from increased spending are **linear**, diminishing, or plateauing over time.
|
167 |
+
- 🧪 **Foundation for More Advanced Models**: This simple regression builds the base for multivariable models involving seasonality, regions, or digital channels.
|
168 |
+
""")
|
169 |
+
|
170 |
+
|
171 |
+
def load_ca_data():
|
172 |
+
data = fetch_california_housing(as_frame=True)
|
173 |
+
X = data.frame.drop(['MedHouseVal'], axis=1)
|
174 |
+
y = data.frame['MedHouseVal']
|
175 |
+
return data.frame, X,y
|
176 |
+
|
177 |
+
def vif_check(df):
|
178 |
+
X = df.drop(columns=['MedHouseVal'])
|
179 |
+
vif_data = pd.DataFrame()
|
180 |
+
vif_data["feature"] = X.columns
|
181 |
+
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
|
182 |
+
return vif_data, X, df['MedHouseVal']
|
183 |
+
|
184 |
+
def build_model(X_train, y_train):
|
185 |
+
model = LinearRegression()
|
186 |
+
model.fit(X_train, y_train)
|
187 |
+
return model
|
188 |
+
|
189 |
+
def main():
|
190 |
+
st.markdown(f"**🧠 Regression Intuitions - Linear Regression Demo**")
|
191 |
+
|
192 |
+
show_intro()
|
193 |
+
|
194 |
+
simple_regression_example()
|
195 |
+
|
196 |
+
with st.expander("➡️ Load & View California Housing Dataset"):
|
197 |
+
df, X, y = load_ca_data()
|
198 |
+
st.dataframe(df.head())
|
199 |
+
|
200 |
+
st.markdown("""
|
201 |
+
The **California Housing Dataset** is based on data from the 1990 U.S. Census.
|
202 |
+
It contains information collected from block groups across California and is often used for regression tasks to predict housing values.
|
203 |
+
|
204 |
+
###### 📌 Columns Description:
|
205 |
+
|
206 |
+
- **MedInc** *(Median Income)*: Median income of households in the block group (in tens of thousands of dollars).
|
207 |
+
- **HouseAge** *(Median House Age)*: Median age of houses in the area.
|
208 |
+
- **AveRooms** *(Average Rooms)*: Average number of rooms per household.
|
209 |
+
- **AveBedrms** *(Average Bedrooms)*: Average number of bedrooms per household.
|
210 |
+
- **Population**: Total population of the block group.
|
211 |
+
- **AveOccup** *(Average Occupancy)*: Average number of people per household.
|
212 |
+
- **Latitude**: Geographical latitude of the block group.
|
213 |
+
- **Longitude**: Geographical longitude of the block group.
|
214 |
+
|
215 |
+
###### 🎯 Target Column:
|
216 |
+
- **MedHouseVal** *(Median House Value)*: This is the target variable to be predicted.
|
217 |
+
It represents the **median house value** in the block group (in hundreds of thousands of dollars).
|
218 |
+
""")
|
219 |
+
|
220 |
+
st.markdown("###### 🗺️ California Housing: Prices by Location")
|
221 |
+
|
222 |
+
fig, ax = plt.subplots(figsize=(12, 5))
|
223 |
+
scatter = ax.scatter(
|
224 |
+
df["Longitude"],
|
225 |
+
df["Latitude"],
|
226 |
+
c=df["MedHouseVal"],
|
227 |
+
cmap="viridis",
|
228 |
+
s=10,
|
229 |
+
alpha=0.5
|
230 |
+
)
|
231 |
+
|
232 |
+
ax.set_title("Median House Value across California", fontsize=14)
|
233 |
+
ax.set_xlabel("Longitude")
|
234 |
+
ax.set_ylabel("Latitude")
|
235 |
+
ax.grid(True)
|
236 |
+
|
237 |
+
# Add color bar to represent house value
|
238 |
+
cbar = plt.colorbar(scatter, ax=ax)
|
239 |
+
cbar.set_label("Median House Value ($100,000s)")
|
240 |
+
|
241 |
+
# Annotate major cities
|
242 |
+
ax.annotate("Los Angeles", xy=(-118.25, 34.05), xytext=(-121, 33.8),
|
243 |
+
arrowprops=dict(facecolor='red', arrowstyle="->"), fontsize=10, color='red')
|
244 |
+
ax.annotate("San Francisco", xy=(-122.42, 37.77), xytext=(-125, 38.5),
|
245 |
+
arrowprops=dict(facecolor='blue', arrowstyle="->"), fontsize=10, color='blue')
|
246 |
+
|
247 |
+
# Shade ocean region (rough approximation: west of longitude -123)
|
248 |
+
ax.axvspan(-125, -123, color='lightblue', alpha=0.3, label="Pacific Ocean")
|
249 |
+
|
250 |
+
# Add legend
|
251 |
+
ax.legend(loc="lower right")
|
252 |
+
|
253 |
+
st.pyplot(fig)
|
254 |
+
|
255 |
+
st.write(f"""
|
256 |
+
- Color represents housing value: darker → cheaper, lighter → more expensive.
|
257 |
+
- notice high-value clusters around coastal regions (e.g., around the Bay Area and Los Angeles).
|
258 |
+
"""
|
259 |
+
)
|
260 |
+
|
261 |
+
with st.expander("➡️ Key Challenges of California Housing Dataset (Regression vs Rule-Based Models)"):
|
262 |
+
st.markdown("""
|
263 |
+
Understanding the limitations of both data and modeling approaches is vital for leaders making data-driven decisions. Below are the key challenges when using this dataset for **regression modeling**, especially compared to traditional **rule-based systems**:
|
264 |
+
|
265 |
+
###### 🔍 Data Challenges (Specific to Regression):
|
266 |
+
- **Non-linear Relationships**: Housing prices may not increase proportionally with income, age, or other features, making simple linear models insufficient.
|
267 |
+
- **Geographic Bias**: Locations like LA and SF have unique dynamics not captured by standard features—housing is expensive due to factors beyond income or age.
|
268 |
+
- **Data Outliers**: Some neighborhoods may have unusually high or low prices, skewing the model's predictions.
|
269 |
+
- **Capped Target Values**: The `MedHouseVal` was capped at $500,000 in the dataset, which can limit the model's ability to predict higher-end housing.
|
270 |
+
|
271 |
+
###### 🤖 Compared to Rule-Based Models:
|
272 |
+
- **Rule-based systems lack adaptability**: Rules like "if income > X, price > Y" cannot account for regional nuances, housing density, or socio-economic patterns.
|
273 |
+
- **Hard to scale**: Adding new rules for every edge case becomes complex and unmanageable over time.
|
274 |
+
- **Not data-driven**: Rule-based logic does not improve from historical data or learn from new patterns.
|
275 |
+
|
276 |
+
###### 🧭 Key Takeaway:
|
277 |
+
> Regression models offer adaptability and learning from patterns across vast geographies and populations. However, they require clean, unbiased data and continuous validation—unlike rule-based systems, which are simple but brittle and not future-proof.
|
278 |
+
""")
|
279 |
+
|
280 |
+
# with st.expander("➡️Linearity Check & VIF"):
|
281 |
+
# vif_data, X, y = vif_check(df)
|
282 |
+
# st.dataframe(vif_data)
|
283 |
+
|
284 |
+
with st.expander("➡️ Prepare Data for the regression model"):
|
285 |
+
|
286 |
+
st.markdown("""
|
287 |
+
Creating training and test datasets is a fundamental step in building machine learning models. It ensures the model learns patterns **only from part of the data**, and is then **evaluated on unseen data** to measure its performance.
|
288 |
+
|
289 |
+
###### 🔧 Why Prepare Data?
|
290 |
+
- **Ensures Model Quality**: Models need structured and clean data to learn effectively.
|
291 |
+
- **Prevents Overfitting**: By separating training from testing, we prevent the model from simply memorizing the data.
|
292 |
+
- **Enables Generalization**: A well-prepared dataset ensures the model can make accurate predictions on new, real-world data.
|
293 |
+
|
294 |
+
###### 📦 Train-Test Split
|
295 |
+
- **Training Set**: Used by the model to learn patterns and relationships between input (features) and output (target).
|
296 |
+
- **Test Set**: Held back during training and used solely to evaluate model performance. It simulates how the model would perform in production.
|
297 |
+
|
298 |
+
###### ✅ Best Practices
|
299 |
+
- **Use an 80/20 or 70/30 split** depending on dataset size.
|
300 |
+
- **Stratify** if your target variable is imbalanced (more applicable in classification).
|
301 |
+
- **Set a random seed** (e.g., `random_state=42`) for reproducibility.
|
302 |
+
- **Clean and preprocess** before splitting to avoid data leakage.
|
303 |
+
- **Avoid using test data during model training or tuning**—this ensures an unbiased evaluation.
|
304 |
+
|
305 |
+
> 🔍 **Key point**: Proper data preparation is like setting the foundation of a building—without it, even the most advanced models can crumble in production.
|
306 |
+
""")
|
307 |
+
|
308 |
+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
309 |
+
|
310 |
+
# Display the number of samples in training and testing sets
|
311 |
+
st.write(f"Number of samples in training set: {X_train.shape[0]}")
|
312 |
+
st.write(f"Number of samples in testing set: {X_test.shape[0]}")
|
313 |
+
st.write("Train and test sets created.")
|
314 |
+
|
315 |
+
with st.expander("➡️ Build Linear Regression Model"):
|
316 |
+
#model = build_model(X_train, y_train)
|
317 |
+
|
318 |
+
st.write("Training the Linear Regression model...")
|
319 |
+
|
320 |
+
# Simulate training with progress bar
|
321 |
+
progress_bar = st.progress(0)
|
322 |
+
for i in range(100):
|
323 |
+
time.sleep(0.01)
|
324 |
+
progress_bar.progress(i + 1)
|
325 |
+
|
326 |
+
# Train the model
|
327 |
+
model = LinearRegression()
|
328 |
+
model.fit(X_train, y_train)
|
329 |
+
st.success("Model trained successfully.")
|
330 |
+
|
331 |
+
# Predict and compute metrics
|
332 |
+
y_pred = model.predict(X_test)
|
333 |
+
mae = mean_absolute_error(y_test, y_pred)
|
334 |
+
mse = mean_squared_error(y_test, y_pred)
|
335 |
+
rmse = np.sqrt(mse)
|
336 |
+
r2 = r2_score(y_test, y_pred)
|
337 |
+
|
338 |
+
st.markdown("###### 📊 Model Evaluation on Test Set")
|
339 |
+
st.write(f"**MAE**: {mae:.2f}")
|
340 |
+
st.write(f"**MSE**: {mse:.2f}")
|
341 |
+
st.write(f"**RMSE**: {rmse:.2f}")
|
342 |
+
st.write(f"**R² Score**: {r2:.2f}")
|
343 |
+
|
344 |
+
# Cross-validation to detect overfitting
|
345 |
+
st.markdown("###### 🔁 Cross-Validation Performance")
|
346 |
+
cv_results = cross_validate(model, X, y, cv=10, return_train_score=True, scoring='r2')
|
347 |
+
train_r2 = cv_results['train_score']
|
348 |
+
test_r2 = cv_results['test_score']
|
349 |
+
|
350 |
+
r2_df = pd.DataFrame({
|
351 |
+
'Fold': list(range(1, 11)),
|
352 |
+
'Training R²': train_r2,
|
353 |
+
'Test R²': test_r2
|
354 |
+
})
|
355 |
+
|
356 |
+
fig, ax = plt.subplots(figsize=(9,5))
|
357 |
+
ax.plot(r2_df['Fold'], r2_df['Training R²'], marker='o', label='Training R²', color='blue')
|
358 |
+
ax.plot(r2_df['Fold'], r2_df['Test R²'], marker='o', label='Test R²', color='green')
|
359 |
+
ax.set_title("Cross-Validation R² Scores")
|
360 |
+
ax.set_xlabel("Fold")
|
361 |
+
ax.set_ylabel("R² Score")
|
362 |
+
ax.legend()
|
363 |
+
ax.grid(True)
|
364 |
+
st.pyplot(fig)
|
365 |
+
|
366 |
+
st.dataframe(r2_df.style.format({'Training R²': '{:.2f}', 'Test R²': '{:.2f}'}))
|
367 |
+
|
368 |
+
st.write("""
|
369 |
+
- ✅ **Consistent Training Performance**:
|
370 |
+
The training R² scores range from **0.59 to 0.63**, indicating a fairly **consistent learning pattern** across all 10 folds.
|
371 |
+
This means the model generalizes reasonably well on the training data.
|
372 |
+
|
373 |
+
- ⚠️ **Test Set Variability**:
|
374 |
+
The test R² scores range from **0.42 to 0.61**, showing **slightly higher variance** across folds.
|
375 |
+
Some folds show strong performance (e.g., Fold 2), while others drop noticeably (e.g., Fold 3).
|
376 |
+
|
377 |
+
- 🔁 **No Severe Overfitting Detected**:
|
378 |
+
If the training R² was very high (e.g., 0.9) and test R² was low (e.g., 0.3), that would indicate **overfitting**.
|
379 |
+
In this case, **training and test R² are fairly close**, suggesting the model is **not overfitting significantly**.
|
380 |
+
|
381 |
+
- 📉 **Room for Improvement**:
|
382 |
+
An average test R² around **0.52** implies that the model explains **just over 50% of the variance** in house values.
|
383 |
+
For business-critical applications like real estate pricing or policy decisions, we may consider:
|
384 |
+
- **Feature engineering** (e.g., regional segmentation),
|
385 |
+
- **Model tuning**, or
|
386 |
+
- **Trying more expressive models** like decision trees or gradient boosting.
|
387 |
+
|
388 |
+
""")
|
389 |
+
|
390 |
+
# learning curve
|
391 |
+
with st.expander("➡️ Was Training Data Sufficient? (Learning Curve Analysis)"):
|
392 |
+
st.markdown("###### 📊 Learning Curve Analysis")
|
393 |
+
|
394 |
+
# Generate learning curves
|
395 |
+
train_sizes, train_scores, test_scores = learning_curve(
|
396 |
+
model, X, y, cv=5, scoring='r2', train_sizes=np.linspace(0.1, 1.0, 10), shuffle=True, random_state=42
|
397 |
+
)
|
398 |
+
|
399 |
+
# Calculate mean and std deviation
|
400 |
+
train_scores_mean = np.mean(train_scores, axis=1)
|
401 |
+
test_scores_mean = np.mean(test_scores, axis=1)
|
402 |
+
|
403 |
+
# Plotting
|
404 |
+
fig, ax = plt.subplots(figsize=(9,4))
|
405 |
+
ax.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training R²")
|
406 |
+
ax.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Validation R²")
|
407 |
+
ax.set_title("Learning Curve: Linear Regression")
|
408 |
+
ax.set_xlabel("Number of Training Samples")
|
409 |
+
ax.set_ylabel("R² Score")
|
410 |
+
ax.legend(loc="best")
|
411 |
+
ax.grid(True)
|
412 |
+
st.pyplot(fig)
|
413 |
+
|
414 |
+
# Interpret results
|
415 |
+
st.write("""
|
416 |
+
- ✅ **Training R² is high initially** (indicating the model learns patterns even with fewer samples).
|
417 |
+
- 📉 **Validation R² improves as training size increases**, then plateaus.
|
418 |
+
- 🧠 This suggests the model **benefits from more training data**, but after a certain point, **additional data does not significantly improve generalization**.
|
419 |
+
- 🔍 The **gap between training and validation curves** is relatively small, indicating **no severe overfitting**.
|
420 |
+
- 📌 **Conclusion**: The current dataset size seems **adequate**, and the model is learning well with the data provided.
|
421 |
+
""")
|
422 |
+
|
423 |
+
with st.expander("📊 Understand Feature Impact: Coefficients of the Linear Regression Model"):
|
424 |
+
importance = model.coef_
|
425 |
+
features = X.columns
|
426 |
+
|
427 |
+
fig, ax = plt.subplots(figsize=(9,5))
|
428 |
+
ax.barh(features, importance, color='skyblue')
|
429 |
+
ax.set_title("Feature Importance (Linear Regression Coefficients)")
|
430 |
+
ax.set_xlabel("Coefficient Value")
|
431 |
+
st.pyplot(fig)
|
432 |
+
|
433 |
+
st.markdown("""
|
434 |
+
###### 🔍 Interpretation:
|
435 |
+
- Features with larger **absolute values** have a stronger effect on the predicted house value.
|
436 |
+
- A **positive coefficient** increases the predicted value.
|
437 |
+
- A **negative coefficient** decreases the predicted value.
|
438 |
+
|
439 |
+
###### 🧠 What it means for decision-makers:
|
440 |
+
- **Median Income** is a strong positive driver — wealthier areas tend to have higher housing values.
|
441 |
+
- **Latitude** has a negative coefficient — northern areas may have lower house prices.
|
442 |
+
- Helps focus strategic decisions on what really influences prices across California.
|
443 |
+
""")
|
444 |
+
|
445 |
+
with st.expander("🧠 Why Linear Regression Still Matters: Foundation for Deep Learning & Transformers"):
|
446 |
+
st.markdown("""
|
447 |
+
Linear Regression may look simple, but it's far from trivial — it’s the **first building block** in the ladder to advanced AI models like **Deep Learning** and **Transformers**.
|
448 |
+
|
449 |
+
###### 📚 Conceptual Foundations:
|
450 |
+
- **Weights & Bias**: The core of linear regression is about learning weights and biases — which is exactly what **every neural network layer** does, just at scale.
|
451 |
+
- **Loss Minimization**: Linear regression minimizes **Mean Squared Error** — a principle used in training neural networks to adjust weights through **backpropagation**.
|
452 |
+
- **Linear Combinations**: Deep learning models, at their core, are just multiple layers of **linear transformations + non-linear activations**.
|
453 |
+
|
454 |
+
###### 🤖 Connect to Transformers:
|
455 |
+
- Transformer architectures (like GPT, BERT) use **linear projections** in attention mechanisms.
|
456 |
+
- Every layer in these models performs matrix multiplications — which is, again, just advanced **linear algebra and regression-like operations**.
|
457 |
+
|
458 |
+
###### 🏗️ Strategic Insight:
|
459 |
+
- A solid grasp of linear regression builds the intuition needed to understand more complex systems.
|
460 |
+
- Senior leaders can better evaluate ML and AI project feasibility and interpret outcomes by understanding these **fundamentals that scale**.
|
461 |
+
|
462 |
+
🔄 *"From Linear Regression to Transformers, it's all about modeling relationships and optimizing parameters — just with different levels of complexity and abstraction."*
|
463 |
+
""")
|
464 |
+
|
465 |
+
|
466 |
+
if __name__ == "__main__":
|
467 |
+
main()
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit
|
2 |
+
scikit-learn
|
3 |
+
pandas
|
4 |
+
seaborn
|
5 |
+
matplotlib
|