Spaces:

gridflowai
/

ml_regression_basics

Running

App Files Files Community

Bhupen commited on 14 days ago

Commit

ce707b9

1 Parent(s): 53b25e6

Add regression basics py file

Browse files

Files changed (2) hide show

app.py +467 -0
requirements.txt +5 -0

app.py ADDED Viewed

	@@ -0,0 +1,467 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.linear_model import LinearRegression
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+from statsmodels.stats.outliers_influence import variance_inflation_factor
+from sklearn.datasets import fetch_california_housing
+from sklearn.model_selection import cross_validate
+import time
+from sklearn.model_selection import learning_curve
+def show_intro():
+    with st.expander(f"➡️ What is Regression?"):
+        st.markdown("""
+        **Regression** is a fundamental statistical technique used to understand and quantify the relationship between a **dependent variable (what you want to predict)** and one or more **independent variables (predictors)**.
+        ---
+        ###### 🔍 Everyday Examples of Regression:
+        - 📈 Predicting **house prices** based on size, location, and number of bedrooms.
+        - 🎓 Estimating a student’s **final grade** based on hours of study and attendance.
+        - 🚗 Forecasting **fuel efficiency** based on engine size and weight of the car.
+        - 🧠 Predicting **IQ scores** or **height** based on parental traits (enter Galton! 👇)
+        ---
+        ###### 👨‍👩‍👧‍👦 Galton’s Theory – *Regression to the Mean*
+        Sir Francis Galton, a 19th-century statistician and cousin of Charles Darwin, studied the heights of parents and their children.
+        He observed:
+        - Very tall parents tended to have children **shorter** than themselves.
+        - Very short parents tended to have children **taller** than themselves.
+        🧠 He coined the term **"regression to the mean"**, which means:
+        > "Extreme traits tend to be followed by traits closer to the average in the next generation."
+        ---
+        ###### 👶 Real-Life Example:
+        - If both parents are exceptionally tall (say, 6'5"), their child is **likely tall**, but **closer to the average height** than the parents — maybe 6'2".
+        - Similarly, if parents are very short, the child’s height tends to “regress” toward the average population height.
+        This pattern **doesn't mean height is random**, just that genetics and environment **pull traits toward typical values** over time.
+        ---
+        Regression models in ML extend this idea — instead of modeling parent-child height, we model **any continuous outcome** based on relevant input variables.
+        """)
+    with st.expander("➡️ Industry Use-Cases of Regression Models"):
+        st.markdown("""
+        ###### 🏥 Healthcare
+        - 🔬 Estimating **patient recovery time** based on age, treatment type, and initial condition.
+        - 💉 Predicting **blood glucose levels** based on dietary habits and medication dosage.
+        - 🫀 Forecasting **hospital readmission rates** based on prior health records and discharge details.
+        ###### 🛒 Retail
+        - 📦 Predicting **sales volume** based on pricing, seasonality, and promotional campaigns.
+        - 🛍️ Estimating **inventory demand** for specific SKUs using historical sales and trends.
+        - 👗 Forecasting **customer churn** likelihood using past purchase behavior and returns.
+        ###### 🛍️ E-commerce
+        - 💸 Predicting **customer lifetime value (CLV)** based on purchase frequency and basket size.
+        - 🚚 Estimating **delivery time** based on warehouse location, item type, and order volume.
+        - 🧾 Forecasting **return probability** of products based on description, images, and reviews.
+        ###### 💰 Finance
+        - 📊 Predicting **stock prices** or **bond yields** based on historical trends and market indicators.
+        - 🏦 Estimating **credit risk** or **loan default probability** using income, credit history, etc.
+        - 💳 Forecasting **spending patterns** on credit cards based on customer behavior.
+        ###### 💊 Pharma & Life Sciences
+        - 🧪 Predicting **drug efficacy** based on dosage and patient demographics in clinical trials.
+        - 🦠 Estimating **disease progression** timelines based on early symptoms and test results.
+        - 💊 Forecasting **adverse drug reactions** from formulation and patient profiles.
+        """)
+def simple_regression_example():
+    with st.expander("➡️ Single Variable Regression (Manual Calculation)"):
+        # Sample data
+        advertising_spend = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+        sales_revenue = np.array([2.1, 3.9, 5.2, 6.0, 7.1, 8.1, 9.0, 10.2, 10.8, 12.0])
+        # Regression coefficients (manual calculation)
+        x_mean = np.mean(advertising_spend)
+        y_mean = np.mean(sales_revenue)
+        b1 = np.sum((advertising_spend - x_mean) * (sales_revenue - y_mean)) / np.sum((advertising_spend - x_mean)**2)
+        b0 = y_mean - b1 * x_mean
+        predicted_sales = b0 + b1 * advertising_spend
+        # Two-column layout
+        col1, col2 = st.columns(2)
+        with col1:
+            st.markdown("###### 📊 Sample Data")
+            df = pd.DataFrame({
+                'Advertising Spend (in lakhs)': advertising_spend,
+                'Sales Revenue (in lakhs)': sales_revenue
+            })
+            st.dataframe(df)
+        with col2:
+            st.markdown("###### 📉 Linear Regression Formula")
+            st.markdown(f"""
+            The linear regression equation is:
+            **Sales Revenue = {b0:.2f} + {b1:.2f} × Advertising Spend**
+            Where:
+            - **b₀ (Intercept)**: Sales revenue when advertising spend is zero.
+            - **b₁ (Slope)**: Increase in revenue for each additional lakh spent.
+            ###### Formula for Computing Coefficients
+            - **b₁ (Slope)** = (Σ(xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
+            - **b₀ (Intercept)** = ȳ - b₁ × x̄
+            """)
+        # Plotting the regression line
+        fig, ax = plt.subplots(figsize=(9,4))
+        ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
+        ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
+        ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
+        ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
+        ax.set_title("Linear Regression: Advertising Spend vs Sales Revenue", fontsize=10)
+        ax.tick_params(axis='both', labelsize=8)
+        ax.legend()
+        st.pyplot(fig)
+    with st.expander("➡️ Predict Sales Revenue from Advertising Spend"):
+        st.markdown(f"Use the trained regression model to forecast expected sales revenue 📈")
+        user_input = st.number_input(
+            "Enter Advertising Spend (in lakhs)",
+            min_value=1.0,
+            max_value=20.0,
+            value=5.0,
+            step=0.5,
+            format="%.1f"
+        )
+        if user_input:
+            predicted_value = b0 + b1 * user_input
+            st.success(f"🔮 Predicted Sales Revenue: **{predicted_value:.2f} lakhs**")
+            # Visualize prediction on the regression chart
+            fig, ax = plt.subplots(figsize=(9,4))
+            ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
+            ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
+            # Add dashed lines for prediction
+            ax.axvline(x=user_input, color='red', linestyle='--', linewidth=1)
+            ax.axhline(y=predicted_value, color='red', linestyle='--', linewidth=1)
+            ax.plot(user_input, predicted_value, 'ro')  # predicted point
+            ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
+            ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
+            ax.set_title("Prediction on Regression Line", fontsize=10)
+            ax.tick_params(axis='both', labelsize=8)
+            ax.legend()
+            st.pyplot(fig)
+    with st.expander("➡️ Key Takeaways ..."):
+        st.markdown("""
+        - 🔍 **Simplicity with Impact**: Even a simple linear model offers valuable foresight—linking investments (like ad spend) directly to outcomes (like sales revenue).
+        - 📊 **Data-Driven Decisions**: Enables leadership to make **objective** decisions, backed by quantitative evidence rather than gut feel.
+        - 🎯 **Budget Optimization**: Helps identify how much to invest to hit revenue targets—minimizing under or over-spending on campaigns.
+        - 📈 **Trend Insights**: Understanding whether returns from increased spending are **linear**, diminishing, or plateauing over time.
+        - 🧪 **Foundation for More Advanced Models**: This simple regression builds the base for multivariable models involving seasonality, regions, or digital channels.
+        """)
+def load_ca_data():
+    data = fetch_california_housing(as_frame=True)
+    X = data.frame.drop(['MedHouseVal'], axis=1)
+    y = data.frame['MedHouseVal']
+    return data.frame, X,y
+def vif_check(df):
+    X = df.drop(columns=['MedHouseVal'])
+    vif_data = pd.DataFrame()
+    vif_data["feature"] = X.columns
+    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
+    return vif_data, X, df['MedHouseVal']
+def build_model(X_train, y_train):
+    model = LinearRegression()
+    model.fit(X_train, y_train)
+    return model
+def main():
+    st.markdown(f"**🧠 Regression Intuitions - Linear Regression Demo**")
+    show_intro()
+    simple_regression_example()
+    with st.expander("➡️ Load & View California Housing Dataset"):
+        df, X, y = load_ca_data()
+        st.dataframe(df.head())
+        st.markdown("""
+            The **California Housing Dataset** is based on data from the 1990 U.S. Census.
+            It contains information collected from block groups across California and is often used for regression tasks to predict housing values.
+            ###### 📌 Columns Description:
+            - **MedInc** *(Median Income)*: Median income of households in the block group (in tens of thousands of dollars).
+            - **HouseAge** *(Median House Age)*: Median age of houses in the area.
+            - **AveRooms** *(Average Rooms)*: Average number of rooms per household.
+            - **AveBedrms** *(Average Bedrooms)*: Average number of bedrooms per household.
+            - **Population**: Total population of the block group.
+            - **AveOccup** *(Average Occupancy)*: Average number of people per household.
+            - **Latitude**: Geographical latitude of the block group.
+            - **Longitude**: Geographical longitude of the block group.
+            ###### 🎯 Target Column:
+            - **MedHouseVal** *(Median House Value)*: This is the target variable to be predicted.
+            It represents the **median house value** in the block group (in hundreds of thousands of dollars).
+            """)
+        st.markdown("###### 🗺️ California Housing: Prices by Location")
+        fig, ax = plt.subplots(figsize=(12, 5))
+        scatter = ax.scatter(
+            df["Longitude"],
+            df["Latitude"],
+            c=df["MedHouseVal"],
+            cmap="viridis",
+            s=10,
+            alpha=0.5
+        )
+        ax.set_title("Median House Value across California", fontsize=14)
+        ax.set_xlabel("Longitude")
+        ax.set_ylabel("Latitude")
+        ax.grid(True)
+        # Add color bar to represent house value
+        cbar = plt.colorbar(scatter, ax=ax)
+        cbar.set_label("Median House Value ($100,000s)")
+        # Annotate major cities
+        ax.annotate("Los Angeles", xy=(-118.25, 34.05), xytext=(-121, 33.8),
+                    arrowprops=dict(facecolor='red', arrowstyle="->"), fontsize=10, color='red')
+        ax.annotate("San Francisco", xy=(-122.42, 37.77), xytext=(-125, 38.5),
+                    arrowprops=dict(facecolor='blue', arrowstyle="->"), fontsize=10, color='blue')
+        # Shade ocean region (rough approximation: west of longitude -123)
+        ax.axvspan(-125, -123, color='lightblue', alpha=0.3, label="Pacific Ocean")
+        # Add legend
+        ax.legend(loc="lower right")
+        st.pyplot(fig)
+        st.write(f"""
+                 - Color represents housing value: darker → cheaper, lighter → more expensive.
+                 - notice high-value clusters around coastal regions (e.g., around the Bay Area and Los Angeles).
+                  """
+                  )
+    with st.expander("➡️ Key Challenges of California Housing Dataset (Regression vs Rule-Based Models)"):
+        st.markdown("""
+        Understanding the limitations of both data and modeling approaches is vital for leaders making data-driven decisions. Below are the key challenges when using this dataset for **regression modeling**, especially compared to traditional **rule-based systems**:
+        ###### 🔍 Data Challenges (Specific to Regression):
+        - **Non-linear Relationships**: Housing prices may not increase proportionally with income, age, or other features, making simple linear models insufficient.
+        - **Geographic Bias**: Locations like LA and SF have unique dynamics not captured by standard features—housing is expensive due to factors beyond income or age.
+        - **Data Outliers**: Some neighborhoods may have unusually high or low prices, skewing the model's predictions.
+        - **Capped Target Values**: The `MedHouseVal` was capped at $500,000 in the dataset, which can limit the model's ability to predict higher-end housing.
+        ###### 🤖 Compared to Rule-Based Models:
+        - **Rule-based systems lack adaptability**: Rules like "if income > X, price > Y" cannot account for regional nuances, housing density, or socio-economic patterns.
+        - **Hard to scale**: Adding new rules for every edge case becomes complex and unmanageable over time.
+        - **Not data-driven**: Rule-based logic does not improve from historical data or learn from new patterns.
+        ###### 🧭 Key Takeaway:
+        > Regression models offer adaptability and learning from patterns across vast geographies and populations. However, they require clean, unbiased data and continuous validation—unlike rule-based systems, which are simple but brittle and not future-proof.
+        """)
+    # with st.expander("➡️Linearity Check & VIF"):
+    #     vif_data, X, y = vif_check(df)
+    #     st.dataframe(vif_data)
+    with st.expander("➡️ Prepare Data for the regression model"):
+        st.markdown("""
+        Creating training and test datasets is a fundamental step in building machine learning models. It ensures the model learns patterns **only from part of the data**, and is then **evaluated on unseen data** to measure its performance.
+        ###### 🔧 Why Prepare Data?
+        - **Ensures Model Quality**: Models need structured and clean data to learn effectively.
+        - **Prevents Overfitting**: By separating training from testing, we prevent the model from simply memorizing the data.
+        - **Enables Generalization**: A well-prepared dataset ensures the model can make accurate predictions on new, real-world data.
+        ###### 📦 Train-Test Split
+        - **Training Set**: Used by the model to learn patterns and relationships between input (features) and output (target).
+        - **Test Set**: Held back during training and used solely to evaluate model performance. It simulates how the model would perform in production.
+        ###### ✅ Best Practices
+        - **Use an 80/20 or 70/30 split** depending on dataset size.
+        - **Stratify** if your target variable is imbalanced (more applicable in classification).
+        - **Set a random seed** (e.g., `random_state=42`) for reproducibility.
+        - **Clean and preprocess** before splitting to avoid data leakage.
+        - **Avoid using test data during model training or tuning**—this ensures an unbiased evaluation.
+        > 🔍 **Key point**: Proper data preparation is like setting the foundation of a building—without it, even the most advanced models can crumble in production.
+        """)
+        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+        # Display the number of samples in training and testing sets
+        st.write(f"Number of samples in training set: {X_train.shape[0]}")
+        st.write(f"Number of samples in testing set: {X_test.shape[0]}")
+        st.write("Train and test sets created.")
+    with st.expander("➡️ Build Linear Regression Model"):
+        #model = build_model(X_train, y_train)
+        st.write("Training the Linear Regression model...")
+        # Simulate training with progress bar
+        progress_bar = st.progress(0)
+        for i in range(100):
+            time.sleep(0.01)
+            progress_bar.progress(i + 1)
+        # Train the model
+        model = LinearRegression()
+        model.fit(X_train, y_train)
+        st.success("Model trained successfully.")
+        # Predict and compute metrics
+        y_pred = model.predict(X_test)
+        mae = mean_absolute_error(y_test, y_pred)
+        mse = mean_squared_error(y_test, y_pred)
+        rmse = np.sqrt(mse)
+        r2 = r2_score(y_test, y_pred)
+        st.markdown("###### 📊 Model Evaluation on Test Set")
+        st.write(f"**MAE**: {mae:.2f}")
+        st.write(f"**MSE**: {mse:.2f}")
+        st.write(f"**RMSE**: {rmse:.2f}")
+        st.write(f"**R² Score**: {r2:.2f}")
+        # Cross-validation to detect overfitting
+        st.markdown("###### 🔁 Cross-Validation Performance")
+        cv_results = cross_validate(model, X, y, cv=10, return_train_score=True, scoring='r2')
+        train_r2 = cv_results['train_score']
+        test_r2  = cv_results['test_score']
+        r2_df = pd.DataFrame({
+            'Fold': list(range(1, 11)),
+            'Training R²': train_r2,
+            'Test R²': test_r2
+        })
+        fig, ax = plt.subplots(figsize=(9,5))
+        ax.plot(r2_df['Fold'], r2_df['Training R²'], marker='o', label='Training R²', color='blue')
+        ax.plot(r2_df['Fold'], r2_df['Test R²'], marker='o', label='Test R²', color='green')
+        ax.set_title("Cross-Validation R² Scores")
+        ax.set_xlabel("Fold")
+        ax.set_ylabel("R² Score")
+        ax.legend()
+        ax.grid(True)
+        st.pyplot(fig)
+        st.dataframe(r2_df.style.format({'Training R²': '{:.2f}', 'Test R²': '{:.2f}'}))
+        st.write("""
+            - ✅ **Consistent Training Performance**:
+            The training R² scores range from **0.59 to 0.63**, indicating a fairly **consistent learning pattern** across all 10 folds.
+            This means the model generalizes reasonably well on the training data.
+            - ⚠️ **Test Set Variability**:
+            The test R² scores range from **0.42 to 0.61**, showing **slightly higher variance** across folds.
+            Some folds show strong performance (e.g., Fold 2), while others drop noticeably (e.g., Fold 3).
+            - 🔁 **No Severe Overfitting Detected**:
+            If the training R² was very high (e.g., 0.9) and test R² was low (e.g., 0.3), that would indicate **overfitting**.
+            In this case, **training and test R² are fairly close**, suggesting the model is **not overfitting significantly**.
+            - 📉 **Room for Improvement**:
+            An average test R² around **0.52** implies that the model explains **just over 50% of the variance** in house values.
+            For business-critical applications like real estate pricing or policy decisions, we may consider:
+                - **Feature engineering** (e.g., regional segmentation),
+                - **Model tuning**, or
+                - **Trying more expressive models** like decision trees or gradient boosting.
+            """)
+    # learning curve
+    with st.expander("➡️ Was Training Data Sufficient? (Learning Curve Analysis)"):
+        st.markdown("###### 📊 Learning Curve Analysis")
+        # Generate learning curves
+        train_sizes, train_scores, test_scores = learning_curve(
+            model, X, y, cv=5, scoring='r2', train_sizes=np.linspace(0.1, 1.0, 10), shuffle=True, random_state=42
+        )
+        # Calculate mean and std deviation
+        train_scores_mean = np.mean(train_scores, axis=1)
+        test_scores_mean = np.mean(test_scores, axis=1)
+        # Plotting
+        fig, ax = plt.subplots(figsize=(9,4))
+        ax.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training R²")
+        ax.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Validation R²")
+        ax.set_title("Learning Curve: Linear Regression")
+        ax.set_xlabel("Number of Training Samples")
+        ax.set_ylabel("R² Score")
+        ax.legend(loc="best")
+        ax.grid(True)
+        st.pyplot(fig)
+        # Interpret results
+        st.write("""
+        - ✅ **Training R² is high initially** (indicating the model learns patterns even with fewer samples).
+        - 📉 **Validation R² improves as training size increases**, then plateaus.
+        - 🧠 This suggests the model **benefits from more training data**, but after a certain point, **additional data does not significantly improve generalization**.
+        - 🔍 The **gap between training and validation curves** is relatively small, indicating **no severe overfitting**.
+        - 📌 **Conclusion**: The current dataset size seems **adequate**, and the model is learning well with the data provided.
+        """)
+    with st.expander("📊 Understand Feature Impact: Coefficients of the Linear Regression Model"):
+        importance = model.coef_
+        features = X.columns
+        fig, ax = plt.subplots(figsize=(9,5))
+        ax.barh(features, importance, color='skyblue')
+        ax.set_title("Feature Importance (Linear Regression Coefficients)")
+        ax.set_xlabel("Coefficient Value")
+        st.pyplot(fig)
+        st.markdown("""
+        ###### 🔍 Interpretation:
+        - Features with larger **absolute values** have a stronger effect on the predicted house value.
+        - A **positive coefficient** increases the predicted value.
+        - A **negative coefficient** decreases the predicted value.
+        ###### 🧠 What it means for decision-makers:
+        - **Median Income** is a strong positive driver — wealthier areas tend to have higher housing values.
+        - **Latitude** has a negative coefficient — northern areas may have lower house prices.
+        - Helps focus strategic decisions on what really influences prices across California.
+        """)
+    with st.expander("🧠 Why Linear Regression Still Matters: Foundation for Deep Learning & Transformers"):
+        st.markdown("""
+        Linear Regression may look simple, but it's far from trivial — it’s the **first building block** in the ladder to advanced AI models like **Deep Learning** and **Transformers**.
+        ###### 📚 Conceptual Foundations:
+        - **Weights & Bias**: The core of linear regression is about learning weights and biases — which is exactly what **every neural network layer** does, just at scale.
+        - **Loss Minimization**: Linear regression minimizes **Mean Squared Error** — a principle used in training neural networks to adjust weights through **backpropagation**.
+        - **Linear Combinations**: Deep learning models, at their core, are just multiple layers of **linear transformations + non-linear activations**.
+        ###### 🤖 Connect to Transformers:
+        - Transformer architectures (like GPT, BERT) use **linear projections** in attention mechanisms.
+        - Every layer in these models performs matrix multiplications — which is, again, just advanced **linear algebra and regression-like operations**.
+        ###### 🏗️ Strategic Insight:
+        - A solid grasp of linear regression builds the intuition needed to understand more complex systems.
+        - Senior leaders can better evaluate ML and AI project feasibility and interpret outcomes by understanding these **fundamentals that scale**.
+        🔄 *"From Linear Regression to Transformers, it's all about modeling relationships and optimizing parameters — just with different levels of complexity and abstraction."*
+        """)
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit
+scikit-learn
+pandas
+seaborn
+matplotlib