Spaces:
Build error
Build error
Upload The Linear_Regression_Algorithm.py
Browse files
pages/The Linear_Regression_Algorithm.py
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import pandas as pd
|
| 3 |
+
|
| 4 |
+
st.set_page_config(page_title="Regression Models Explorer", layout="wide")
|
| 5 |
+
|
| 6 |
+
st.title("📊 Regression Models - Linear Regression")
|
| 7 |
+
|
| 8 |
+
# --- Linear Regression Section ---
|
| 9 |
+
st.header("📈 Linear Regression - In Depth")
|
| 10 |
+
|
| 11 |
+
st.markdown("""
|
| 12 |
+
## 📘 What is Linear Regression?
|
| 13 |
+
Linear Regression is a **supervised learning** algorithm used to predict **continuous numeric outcomes** based on one or more input features. It assumes a **linear relationship** between the independent variable(s) (X) and the dependent variable (y).
|
| 14 |
+
|
| 15 |
+
Linear Regression can be:
|
| 16 |
+
- **Simple Linear Regression**: One feature (X)
|
| 17 |
+
- **Multiple Linear Regression**: Multiple features (X1, X2, ..., Xn)
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 📐 Mathematical Formulation
|
| 22 |
+
The standard form of the linear model is:
|
| 23 |
+
|
| 24 |
+
\[ y = w_1x_1 + w_2x_2 + ... + w_nx_n + b = w^T x + b \]
|
| 25 |
+
|
| 26 |
+
Where:
|
| 27 |
+
- `x`: input feature vector (independent variables)
|
| 28 |
+
- `w`: weights (coefficients learned by the model)
|
| 29 |
+
- `b`: bias (intercept)
|
| 30 |
+
- `y`: predicted continuous output
|
| 31 |
+
|
| 32 |
+
The model parameters are estimated using **Ordinary Least Squares (OLS)** by minimizing the **Mean Squared Error (MSE)**:
|
| 33 |
+
|
| 34 |
+
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 🔍 Key Concepts and Assumptions
|
| 39 |
+
|
| 40 |
+
### 1. **Least Squares Estimation**
|
| 41 |
+
The method used to minimize the squared differences between actual and predicted values.
|
| 42 |
+
|
| 43 |
+
### 2. **Residuals**
|
| 44 |
+
The difference between actual values and predicted values \( e_i = y_i - \hat{y}_i \). Analyzing residuals helps identify model misspecification.
|
| 45 |
+
|
| 46 |
+
### 3. **R-squared (R²)**
|
| 47 |
+
Represents the proportion of variance explained by the model:
|
| 48 |
+
|
| 49 |
+
\[ R^2 = 1 - \frac{\text{SSR}}{\text{SST}} \]
|
| 50 |
+
Where SSR = Sum of Squared Residuals, SST = Total Sum of Squares.
|
| 51 |
+
|
| 52 |
+
### 4. **Key Assumptions**
|
| 53 |
+
- **Linearity**: The relationship between X and y is linear.
|
| 54 |
+
- **Independence**: Observations are independent of each other.
|
| 55 |
+
- **Homoscedasticity**: Constant variance of residuals.
|
| 56 |
+
- **Normality**: Residuals are normally distributed.
|
| 57 |
+
- **No Multicollinearity**: Features should not be highly correlated.
|
| 58 |
+
|
| 59 |
+
Violating these assumptions can reduce the reliability of predictions.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 🔧 Common Hyperparameters (Scikit-learn)
|
| 64 |
+
|
| 65 |
+
| Parameter | Description |
|
| 66 |
+
|------------------|-----------------------------------------------------------------------------|
|
| 67 |
+
| `fit_intercept` | If True, model calculates the intercept. Set False when data is already centered. |
|
| 68 |
+
| `copy_X` | If True, X is copied. Set False for memory efficiency. |
|
| 69 |
+
| `n_jobs` | Number of cores used for computation. Useful for large datasets. |
|
| 70 |
+
|
| 71 |
+
**Note**: `normalize` has been deprecated. Use preprocessing (e.g., `StandardScaler`) in a pipeline instead.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## ✅ Advantages
|
| 76 |
+
- Simple and fast to implement.
|
| 77 |
+
- Works well when assumptions are met.
|
| 78 |
+
- Coefficients are interpretable.
|
| 79 |
+
- Requires less computational power.
|
| 80 |
+
|
| 81 |
+
## ❌ Disadvantages
|
| 82 |
+
- Limited to linear relationships.
|
| 83 |
+
- Sensitive to outliers.
|
| 84 |
+
- Poor performance with irrelevant or highly correlated features.
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## 🧪 Optuna for Hyperparameter Tuning (Conceptual)
|
| 89 |
+
Linear Regression has limited hyperparameters, but Optuna is helpful in more complex pipelines:
|
| 90 |
+
|
| 91 |
+
- **Polynomial Regression**: Tune the degree of the polynomial.
|
| 92 |
+
- **Ridge / Lasso / ElasticNet**: Tune regularization strength (`alpha`, `l1_ratio`).
|
| 93 |
+
- **Feature Selection**: Use Optuna to select best subset of features.
|
| 94 |
+
|
| 95 |
+
You define an **objective function** (e.g., minimize RMSE), and Optuna optimizes hyperparameters via **Bayesian optimization**.
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## 📌 Use Cases
|
| 100 |
+
- Predicting house prices based on features like area, rooms, location
|
| 101 |
+
- Sales forecasting from historical data
|
| 102 |
+
- Medical cost estimation from patient info
|
| 103 |
+
- Predicting CO₂ emissions from engine parameters
|
| 104 |
+
|
| 105 |
+
📎 **Tip**: Always visualize **residual plots** to verify assumptions and consider adding interaction or polynomial terms for capturing complexity.
|
| 106 |
+
""")
|