{ "cells": [ { "cell_type": "markdown", "id": "c73f8bf0-b957-4da5-88ab-4b030586cde5", "metadata": {}, "source": [ "# DIMENSIONALITY REDUCTION\n", "\n", "--------------------------------------------\n", "PHASE 1: EXPLAIN & BREAKDOWN (LEARNING PHASE)\n", "--------------------------------------------\n", "\n", "## 1. Simple Explanation (100-150 words)\n", "\n", "Dimensionality reduction is like taking a 3D object and creating a 2D shadow that preserves the most important information. Imagine you have a dataset with 1000 features (columns) describing each data point, but many features are redundant or noisy. Dimensionality reduction techniques help you compress this data into fewer dimensions (maybe 10-50) while keeping the essential patterns intact.\n", "\n", "Think of it like summarizing a 500-page book into a 20-page summary - you lose some details, but the main ideas remain. This is crucial in AI because high-dimensional data is hard to visualize, slow to process, and prone to the \"curse of dimensionality\" (where algorithms perform poorly in high dimensions). Common techniques include PCA (Principal Component Analysis), t-SNE, and autoencoders. It's used everywhere: image compression, data visualization, noise reduction, and preparing data for machine learning models.\n", "\n", "## 2. Detailed Roadmap with Concrete Examples\n", "\n", "**Step 1: Understanding the Problem**\n", "- **Curse of Dimensionality**: Example - Finding nearest neighbors in 2D vs 1000D space\n", "- **Computational Complexity**: Example - Processing 28×28 pixel images (784 features) vs 10 compressed features\n", "- **Visualization Challenges**: Example - Plotting customer data with 50 attributes\n", "\n", "**Step 2: Linear Dimensionality Reduction**\n", "- **Principal Component Analysis (PCA)**: Example - Reducing face images from 10,000 pixels to 100 principal components\n", "- **Linear Discriminant Analysis (LDA)**: Example - Separating iris flower species using 2 components instead of 4 features\n", "- **Factor Analysis**: Example - Finding underlying factors in psychological test scores\n", "\n", "**Step 3: Non-Linear Dimensionality Reduction**\n", "- **t-SNE**: Example - Visualizing high-dimensional word embeddings in 2D scatter plots\n", "- **UMAP**: Example - Exploring single-cell RNA sequencing data clusters\n", "- **Isomap**: Example - Unfolding Swiss roll dataset to reveal underlying 2D structure\n", "\n", "**Step 4: Neural Network Approaches**\n", "- **Autoencoders**: Example - Compressing MNIST digit images from 784 to 32 dimensions\n", "- **Variational Autoencoders (VAE)**: Example - Generating new faces by sampling from learned latent space\n", "- **Deep Feature Learning**: Example - Using CNN layers as feature extractors\n", "\n", "**Step 5: Evaluation and Selection**\n", "- **Explained Variance**: Example - Choosing number of PCA components to retain 95% variance\n", "- **Reconstruction Error**: Example - Measuring how well compressed images match originals\n", "- **Downstream Task Performance**: Example - Classification accuracy after dimensionality reduction\n", "\n", "## 3. Formula Memory Aids Section\n", "\n", "### PCA Covariance Matrix Formula\n", "**FORMULA**: C = (1/n) × X^T × X\n", "\n", "**REAL-LIFE ANALOGY**: \"How do your friends' personalities relate to each other?\"\n", "- C = Friendship compatibility matrix\n", "- X = Each friend's personality traits (rows=friends, columns=traits)\n", "- X^T = Flipping the friend-trait table\n", "- 1/n = Averaging across all your friends\n", "\n", "**MEMORY TRICK**: \"Covariance = Correlation of Variance - how features dance together!\"\n", "\n", "### PCA Eigenvalue Decomposition Formula\n", "**FORMULA**: C × v = λ × v\n", "\n", "**REAL-LIFE ANALOGY**: \"Which direction does your friend group naturally lean?\"\n", "- C = Group's personality compatibility matrix\n", "- v = Direction of strongest group tendency (eigenvector)\n", "- λ = How strong that tendency is (eigenvalue)\n", "- The equation means: \"Group tendency × Direction = Strength × Same Direction\"\n", "\n", "**MEMORY TRICK**: \"Eigen = 'Own' in German - finding data's 'own' natural directions!\"\n", "\n", "### Explained Variance Ratio Formula\n", "**FORMULA**: Explained Variance = λᵢ / Σλⱼ\n", "\n", "**REAL-LIFE ANALOGY**: \"What percentage of your friend group's energy goes into sports vs studies?\"\n", "- λᵢ = Energy spent on sports (one eigenvalue)\n", "- Σλⱼ = Total energy of the group (sum of all eigenvalues)\n", "- Ratio = Sports energy / Total energy\n", "\n", "**MEMORY TRICK**: \"Explained = Ex-plained on a plane - how much info fits on each dimension!\"\n", "\n", "### t-SNE Similarity Formula\n", "**FORMULA**: pᵢⱼ = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||²/2σᵢ²)\n", "\n", "**REAL-LIFE ANALOGY**: \"How similar are two people in a crowded room?\"\n", "- pᵢⱼ = Similarity between person i and person j\n", "- ||xᵢ - xⱼ||² = How different their personalities are (squared distance)\n", "- σᵢ² = How picky person i is about friendships (bandwidth)\n", "- exp(-distance/pickiness) = Friendship probability decreases with distance/pickiness\n", "\n", "**MEMORY TRICK**: \"t-SNE = t-See Neighbors Everywhere - finding similar points!\"\n", "\n", "## 4. Step-by-Step Numerical Example (PCA on 2D data)\n", "\n", "**Dataset**: 4 points in 2D space\n", "```\n", "Point 1: (1, 2)\n", "Point 2: (3, 4) \n", "Point 3: (5, 6)\n", "Point 4: (7, 8)\n", "```\n", "\n", "**Step 1: Center the data (subtract mean)**\n", "```\n", "Mean = (4, 5)\n", "Centered data:\n", "Point 1: (-3, -3)\n", "Point 2: (-1, -1)\n", "Point 3: (1, 1)\n", "Point 4: (3, 3)\n", "```\n", "\n", "**Step 2: Calculate covariance matrix**\n", "```\n", "X = [[-3, -3],\n", " [-1, -1],\n", " [1, 1],\n", " [3, 3]]\n", "\n", "C = (1/4) × X^T × X\n", " = (1/4) × [[20, 20],\n", " [20, 20]]\n", " = [[5, 5],\n", " [5, 5]]\n", "```\n", "\n", "**Step 3: Find eigenvalues and eigenvectors**\n", "```\n", "Characteristic equation: det(C - λI) = 0\n", "(5-λ)² - 25 = 0\n", "λ² - 10λ = 0\n", "λ₁ = 10, λ₂ = 0\n", "\n", "Eigenvector for λ₁ = 10: v₁ = [1/√2, 1/√2]\n", "Eigenvector for λ₂ = 0: v₂ = [1/√2, -1/√2]\n", "```\n", "\n", "**Step 4: Project data onto first principal component**\n", "```\n", "PC1 = X × v₁ = [[-3, -3], [-1, -1], [1, 1], [3, 3]] × [1/√2, 1/√2]\n", " = [-6/√2, -2/√2, 2/√2, 6/√2]\n", " = [-4.24, -1.41, 1.41, 4.24]\n", "```\n", "\n", "**Result**: 2D data reduced to 1D with 100% explained variance!\n", "\n", "## 5. Real-World AI Use Case\n", "\n", "**Netflix Recommendation System**:\n", "Netflix has millions of users and thousands of movies, creating a massive user-movie rating matrix. Using matrix factorization (a form of dimensionality reduction), they:\n", "\n", "1. **Compress user preferences**: Reduce each user's 10,000+ movie ratings to ~50 latent factors (like \"action lover\", \"comedy fan\", \"indie preference\")\n", "2. **Compress movie features**: Reduce each movie's characteristics to the same 50 factors\n", "3. **Make predictions**: Multiply user factors × movie factors to predict ratings\n", "4. **Handle sparsity**: Most users haven't rated most movies, but the compressed representation can still make predictions\n", "\n", "This reduces storage, speeds up computation, and reveals hidden patterns like \"users who like sci-fi also tend to like thrillers.\"\n", "\n", "## 6. Tips for Mastering This Topic\n", "\n", "**Practice Sources**:\n", "- Scikit-learn documentation and examples\n", "- Kaggle datasets (Iris, Wine, Breast Cancer for beginners)\n", "- Andrew Ng's CS229 Stanford lectures on PCA\n", "- Sebastian Raschka's \"Python Machine Learning\" book\n", "\n", "**Hands-on Projects**:\n", "1. **Visualize high-dimensional data**: Use t-SNE on MNIST digits\n", "2. **Image compression**: Apply PCA to face images\n", "3. **Feature selection**: Compare PCA vs original features for classification\n", "4. **Clustering**: Use dimensionality reduction before K-means\n", "\n", "**Key Resources**:\n", "- **Theory**: \"Elements of Statistical Learning\" (Hastie, Tibshirani, Friedman)\n", "- **Implementation**: Scikit-learn user guide on decomposition\n", "- **Visualization**: Matplotlib and Plotly for 2D/3D scatter plots\n", "- **Practice**: Coursera ML course assignments\n", "\n", "**Common Pitfalls to Avoid**:\n", "- Don't apply PCA to categorical variables\n", "- Always scale/normalize data before PCA\n", "- Remember: PCA removes the mean, so center your data first\n", "- Choose components based on explained variance, not just arbitrary numbers\n", "\n", "Ready to move to implementation? Say \"Understood\" and I'll provide the complete Python code with logging!" ] }, { "cell_type": "code", "execution_count": 2, "id": "1af263fd-a090-4126-9af3-cb6afef9efff", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "3925.08s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: numpy in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (2.3.1)\n", "Requirement already satisfied: pandas in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (2.3.1)\n", "Requirement already satisfied: scikit-learn in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (1.7.0)\n", "Requirement already satisfied: matplotlib in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (3.10.3)\n", "Requirement already satisfied: seaborn in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (0.13.2)\n", "Requirement already satisfied: plotly in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (6.2.0)\n", "Collecting umap-learn\n", " Downloading umap_learn-0.5.9.post2-py3-none-any.whl.metadata (25 kB)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from pandas) (2025.2)\n", "Requirement already satisfied: scipy>=1.8.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from scikit-learn) (1.16.0)\n", "Requirement already satisfied: joblib>=1.2.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from scikit-learn) (1.5.1)\n", "Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from scikit-learn) (3.6.0)\n", "Requirement already satisfied: contourpy>=1.0.1 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (1.3.2)\n", "Requirement already satisfied: cycler>=0.10 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (4.58.5)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (1.4.8)\n", "Requirement already satisfied: packaging>=20.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (25.0)\n", "Requirement already satisfied: pillow>=8 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from matplotlib) (3.2.3)\n", "Requirement already satisfied: narwhals>=1.15.1 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from plotly) (1.47.0)\n", "Collecting numba>=0.51.2 (from umap-learn)\n", " Downloading numba-0.61.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.7 kB)\n", "Collecting pynndescent>=0.5 (from umap-learn)\n", " Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)\n", "Requirement already satisfied: tqdm in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from umap-learn) (4.67.1)\n", "Collecting llvmlite<0.45,>=0.44.0dev0 (from numba>=0.51.2->umap-learn)\n", " Downloading llvmlite-0.44.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.8 kB)\n", "Collecting numpy\n", " Downloading numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)\n", "Requirement already satisfied: six>=1.5 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n", "Downloading umap_learn-0.5.9.post2-py3-none-any.whl (90 kB)\n", "Downloading numba-0.61.2-cp313-cp313-macosx_11_0_arm64.whl (2.8 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m17.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.1/5.1 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m:01\u001b[0m\n", "\u001b[?25hDownloading llvmlite-0.44.0-cp313-cp313-macosx_11_0_arm64.whl (26.2 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m26.2/26.2 MB\u001b[0m \u001b[31m541.2 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:02\u001b[0m\n", "\u001b[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)\n", "Installing collected packages: numpy, llvmlite, numba, pynndescent, umap-learn\n", "\u001b[2K Attempting uninstall: numpy\n", "\u001b[2K Found existing installation: numpy 2.3.1\n", "\u001b[2K Uninstalling numpy-2.3.1:\n", "\u001b[2K Successfully uninstalled numpy-2.3.1\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5/5\u001b[0m [umap-learn]\u001b[0m \u001b[32m2/5\u001b[0m [numba]te]\n", "\u001b[1A\u001b[2KSuccessfully installed llvmlite-0.44.0 numba-0.61.2 numpy-2.2.6 pynndescent-0.5.13 umap-learn-0.5.9.post2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "3978.36s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: torch in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (2.7.1)\n", "Requirement already satisfied: torchvision in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (0.22.1)\n", "Requirement already satisfied: filelock in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (3.18.0)\n", "Requirement already satisfied: typing-extensions>=4.10.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (4.14.1)\n", "Requirement already satisfied: setuptools in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (80.9.0)\n", "Requirement already satisfied: sympy>=1.13.3 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (1.14.0)\n", "Requirement already satisfied: networkx in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (3.5)\n", "Requirement already satisfied: jinja2 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (3.1.6)\n", "Requirement already satisfied: fsspec in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torch) (2025.3.0)\n", "Requirement already satisfied: numpy in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torchvision) (2.2.6)\n", "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from torchvision) (11.3.0)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from sympy>=1.13.3->torch) (1.3.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages (from jinja2->torch) (3.0.2)\n" ] } ], "source": [ "!pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn\n", "!pip install torch torchvision # For autoencoder implementation" ] }, { "cell_type": "code", "execution_count": 3, "id": "ba8acc30-65df-4ca8-9852-5a09177e4195", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-07-16 10:36:41,644 - INFO - Starting Dimensionality Reduction Suite\n", "2025-07-16 10:36:41,645 - INFO - Loading datasets for dimensionality reduction analysis\n", "2025-07-16 10:36:41,647 - INFO - Iris dataset loaded: (150, 4) features, 3 classes\n", "2025-07-16 10:36:41,656 - INFO - Digits dataset loaded: (1797, 64) features, 10 classes\n", "2025-07-16 10:36:41,658 - INFO - Data standardization completed\n", "2025-07-16 10:36:41,658 - INFO - === APPLYING PCA ===\n", "2025-07-16 10:36:41,658 - INFO - Applying PCA to iris dataset\n", "2025-07-16 10:36:41,661 - INFO - PCA completed for iris\n", "2025-07-16 10:36:41,661 - INFO - Explained variance per component: [0.72962445 0.22850762]\n", "2025-07-16 10:36:41,661 - INFO - Cumulative explained variance: [0.72962445 0.95813207]\n", "2025-07-16 10:36:41,662 - INFO - Applying PCA to digits dataset\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_pca.py:604: RuntimeWarning: divide by zero encountered in matmul\n", " C = X.T @ X\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_pca.py:604: RuntimeWarning: overflow encountered in matmul\n", " C = X.T @ X\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_pca.py:604: RuntimeWarning: invalid value encountered in matmul\n", " C = X.T @ X\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_base.py:148: RuntimeWarning: divide by zero encountered in matmul\n", " X_transformed = X @ self.components_.T\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_base.py:148: RuntimeWarning: overflow encountered in matmul\n", " X_transformed = X @ self.components_.T\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/decomposition/_base.py:148: RuntimeWarning: invalid value encountered in matmul\n", " X_transformed = X @ self.components_.T\n", "2025-07-16 10:36:41,670 - INFO - PCA completed for digits\n", "2025-07-16 10:36:41,671 - INFO - Explained variance per component: [0.12033916 0.09561054]\n", "2025-07-16 10:36:41,672 - INFO - Cumulative explained variance: [0.12033916 0.21594971]\n", "2025-07-16 10:36:41,672 - INFO - === APPLYING t-SNE ===\n", "2025-07-16 10:36:41,673 - INFO - Applying t-SNE to iris dataset with perplexity=30\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: divide by zero encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: overflow encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: invalid value encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: divide by zero encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: overflow encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: invalid value encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "2025-07-16 10:36:42,159 - INFO - t-SNE completed for iris\n", "2025-07-16 10:36:42,159 - INFO - Final KL divergence: 0.14698290824890137\n", "2025-07-16 10:36:42,159 - INFO - Applying t-SNE to digits dataset with perplexity=30\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: divide by zero encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: overflow encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:350: RuntimeWarning: invalid value encountered in matmul\n", " Q, _ = normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: divide by zero encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: overflow encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:351: RuntimeWarning: invalid value encountered in matmul\n", " Q, _ = normalizer(A.T @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:355: RuntimeWarning: divide by zero encountered in matmul\n", " Q, _ = qr_normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:355: RuntimeWarning: overflow encountered in matmul\n", " Q, _ = qr_normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:355: RuntimeWarning: invalid value encountered in matmul\n", " Q, _ = qr_normalizer(A @ Q)\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:577: RuntimeWarning: divide by zero encountered in matmul\n", " B = Q.T @ M\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:577: RuntimeWarning: overflow encountered in matmul\n", " B = Q.T @ M\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:577: RuntimeWarning: invalid value encountered in matmul\n", " B = Q.T @ M\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:590: RuntimeWarning: divide by zero encountered in matmul\n", " U = Q @ Uhat\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:590: RuntimeWarning: overflow encountered in matmul\n", " U = Q @ Uhat\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/sklearn/utils/extmath.py:590: RuntimeWarning: invalid value encountered in matmul\n", " U = Q @ Uhat\n", "2025-07-16 10:36:43,689 - INFO - t-SNE completed for digits\n", "2025-07-16 10:36:43,690 - INFO - Final KL divergence: 0.8376309275627136\n", "2025-07-16 10:36:43,690 - INFO - === APPLYING UMAP ===\n", "2025-07-16 10:36:43,691 - INFO - Applying UMAP to iris dataset with n_neighbors=15\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.\n", " warn(\n", "2025-07-16 10:36:46,402 - INFO - UMAP completed for iris\n", "2025-07-16 10:36:46,403 - INFO - Applying UMAP to digits dataset with n_neighbors=15\n", "/Users/karthik/Desktop/importants/venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.\n", " warn(\n", "2025-07-16 10:36:48,356 - INFO - UMAP completed for digits\n", "2025-07-16 10:36:48,356 - INFO - === APPLYING AUTOENCODER ===\n", "2025-07-16 10:36:48,356 - INFO - Training autoencoder for iris dataset\n", "2025-07-16 10:36:48,357 - INFO - Input dimension: 4, Encoding dimension: 2\n", "2025-07-16 10:36:49,110 - INFO - Epoch 20/50, Loss: 0.314444\n", "2025-07-16 10:36:49,122 - INFO - Epoch 40/50, Loss: 0.169524\n", "2025-07-16 10:36:49,140 - INFO - Autoencoder training completed for iris\n", "2025-07-16 10:36:49,140 - INFO - Final reconstruction loss: 0.081181\n", "2025-07-16 10:36:49,143 - INFO - Training autoencoder for digits dataset\n", "2025-07-16 10:36:49,144 - INFO - Input dimension: 64, Encoding dimension: 10\n", "2025-07-16 10:36:49,215 - INFO - Epoch 20/100, Loss: 0.856640\n", "2025-07-16 10:36:49,266 - INFO - Epoch 40/100, Loss: 0.649845\n", "2025-07-16 10:36:49,316 - INFO - Epoch 60/100, Loss: 0.515600\n", "2025-07-16 10:36:49,363 - INFO - Epoch 80/100, Loss: 0.427001\n", "2025-07-16 10:36:49,408 - INFO - Epoch 100/100, Loss: 0.348234\n", "2025-07-16 10:36:49,409 - INFO - Autoencoder training completed for digits\n", "2025-07-16 10:36:49,410 - INFO - Final reconstruction loss: 0.348234\n", "2025-07-16 10:36:49,410 - INFO - === EVALUATING METHODS ===\n", "2025-07-16 10:36:49,410 - INFO - Evaluating PCA performance on iris dataset\n", "2025-07-16 10:36:49,511 - INFO - Original data accuracy: 0.8889\n", "2025-07-16 10:36:49,511 - INFO - Reduced data accuracy: 0.8667\n", "2025-07-16 10:36:49,512 - INFO - Accuracy retention: 97.50%\n", "2025-07-16 10:36:49,512 - INFO - Evaluating TSNE performance on iris dataset\n", "2025-07-16 10:36:49,604 - INFO - Original data accuracy: 0.8889\n", "2025-07-16 10:36:49,604 - INFO - Reduced data accuracy: 0.9333\n", "2025-07-16 10:36:49,604 - INFO - Accuracy retention: 105.00%\n", "2025-07-16 10:36:49,605 - INFO - Evaluating UMAP performance on iris dataset\n", "2025-07-16 10:36:49,697 - INFO - Original data accuracy: 0.8889\n", "2025-07-16 10:36:49,698 - INFO - Reduced data accuracy: 0.9111\n", "2025-07-16 10:36:49,698 - INFO - Accuracy retention: 102.50%\n", "2025-07-16 10:36:49,698 - INFO - Evaluating PCA performance on digits dataset\n", "2025-07-16 10:36:49,920 - INFO - Original data accuracy: 0.9685\n", "2025-07-16 10:36:49,920 - INFO - Reduced data accuracy: 0.5074\n", "2025-07-16 10:36:49,920 - INFO - Accuracy retention: 52.39%\n", "2025-07-16 10:36:49,921 - INFO - Evaluating TSNE performance on digits dataset\n", "2025-07-16 10:36:50,117 - INFO - Original data accuracy: 0.9685\n", "2025-07-16 10:36:50,117 - INFO - Reduced data accuracy: 0.9722\n", "2025-07-16 10:36:50,117 - INFO - Accuracy retention: 100.38%\n", "2025-07-16 10:36:50,118 - INFO - Evaluating UMAP performance on digits dataset\n", "2025-07-16 10:36:50,325 - INFO - Original data accuracy: 0.9685\n", "2025-07-16 10:36:50,326 - INFO - Reduced data accuracy: 0.9611\n", "2025-07-16 10:36:50,326 - INFO - Accuracy retention: 99.24%\n", "2025-07-16 10:36:50,326 - INFO - Creating comprehensive visualizations\n", "2025-07-16 10:36:51,350 - INFO - All visualizations saved to visualizations/ directory\n", "2025-07-16 10:36:51,351 - INFO - Saving trained models\n", "2025-07-16 10:36:51,360 - INFO - Saving results summary\n", "2025-07-16 10:36:51,361 - INFO - === FINAL SUMMARY ===\n", "2025-07-16 10:36:51,362 - INFO - Iris Dataset - PCA Explained Variance: [0.72962445 0.22850762]\n", "2025-07-16 10:36:51,362 - INFO - Digits Dataset - PCA Explained Variance: [0.12033916 0.09561054]\n", "2025-07-16 10:36:51,363 - INFO - \n", "IRIS Dataset Classification Performance:\n", "2025-07-16 10:36:51,363 - INFO - PCA: 97.50% accuracy retention\n", "2025-07-16 10:36:51,363 - INFO - TSNE: 105.00% accuracy retention\n", "2025-07-16 10:36:51,363 - INFO - UMAP: 102.50% accuracy retention\n", "2025-07-16 10:36:51,363 - INFO - \n", "DIGITS Dataset Classification Performance:\n", "2025-07-16 10:36:51,364 - INFO - PCA: 52.39% accuracy retention\n", "2025-07-16 10:36:51,364 - INFO - TSNE: 100.38% accuracy retention\n", "2025-07-16 10:36:51,364 - INFO - UMAP: 99.24% accuracy retention\n", "2025-07-16 10:36:51,365 - INFO - \n", "All models saved to models/ directory\n", "2025-07-16 10:36:51,365 - INFO - All results saved to results/ directory\n", "2025-07-16 10:36:51,365 - INFO - All visualizations saved to visualizations/ directory\n", "2025-07-16 10:36:51,366 - INFO - Dimensionality Reduction Suite completed successfully!\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "from sklearn.datasets import load_iris, load_digits\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.manifold import TSNE\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import accuracy_score, classification_report\n", "import umap\n", "import torch\n", "import torch.nn as nn\n", "import torch.optim as optim\n", "import pickle\n", "import json\n", "import logging\n", "import os\n", "from datetime import datetime\n", "\n", "# Configure logging\n", "logging.basicConfig(\n", " level=logging.INFO,\n", " format='%(asctime)s - %(levelname)s - %(message)s',\n", " handlers=[\n", " logging.FileHandler('dimensionality_reduction.log'),\n", " logging.StreamHandler()\n", " ]\n", ")\n", "\n", "# Create results directory\n", "os.makedirs('results', exist_ok=True)\n", "os.makedirs('models', exist_ok=True)\n", "os.makedirs('visualizations', exist_ok=True)\n", "\n", "class DimensionalityReductionSuite:\n", " def __init__(self):\n", " self.results = {}\n", " self.models = {}\n", " \n", " def load_and_prepare_data(self):\n", " logging.info(\"Loading datasets for dimensionality reduction analysis\")\n", " \n", " # Load Iris dataset (low-dimensional example)\n", " iris = load_iris()\n", " self.iris_data = iris.data\n", " self.iris_target = iris.target\n", " self.iris_target_names = iris.target_names\n", " self.iris_feature_names = iris.feature_names\n", " \n", " logging.info(f\"Iris dataset loaded: {self.iris_data.shape} features, {len(np.unique(self.iris_target))} classes\")\n", " \n", " # Load Digits dataset (high-dimensional example)\n", " digits = load_digits()\n", " self.digits_data = digits.data\n", " self.digits_target = digits.target\n", " self.digits_images = digits.images\n", " \n", " logging.info(f\"Digits dataset loaded: {self.digits_data.shape} features, {len(np.unique(self.digits_target))} classes\")\n", " \n", " # Standardize the data\n", " self.scaler_iris = StandardScaler()\n", " self.iris_scaled = self.scaler_iris.fit_transform(self.iris_data)\n", " \n", " self.scaler_digits = StandardScaler()\n", " self.digits_scaled = self.scaler_digits.fit_transform(self.digits_data)\n", " \n", " logging.info(\"Data standardization completed\")\n", " \n", " def apply_pca(self, data, dataset_name, n_components=2):\n", " logging.info(f\"Applying PCA to {dataset_name} dataset\")\n", " \n", " pca = PCA(n_components=n_components)\n", " data_pca = pca.fit_transform(data)\n", " \n", " # Calculate explained variance\n", " explained_variance = pca.explained_variance_ratio_\n", " cumulative_variance = np.cumsum(explained_variance)\n", " \n", " logging.info(f\"PCA completed for {dataset_name}\")\n", " logging.info(f\"Explained variance per component: {explained_variance}\")\n", " logging.info(f\"Cumulative explained variance: {cumulative_variance}\")\n", " \n", " # Store results\n", " self.results[f'{dataset_name}_pca'] = {\n", " 'transformed_data': data_pca,\n", " 'explained_variance': explained_variance,\n", " 'cumulative_variance': cumulative_variance,\n", " 'components': pca.components_\n", " }\n", " \n", " self.models[f'{dataset_name}_pca'] = pca\n", " \n", " return data_pca, explained_variance\n", " \n", " def apply_tsne(self, data, dataset_name, n_components=2, perplexity=30):\n", " logging.info(f\"Applying t-SNE to {dataset_name} dataset with perplexity={perplexity}\")\n", " \n", " tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=42)\n", " data_tsne = tsne.fit_transform(data)\n", " \n", " logging.info(f\"t-SNE completed for {dataset_name}\")\n", " logging.info(f\"Final KL divergence: {tsne.kl_divergence_}\")\n", " \n", " # Store results\n", " self.results[f'{dataset_name}_tsne'] = {\n", " 'transformed_data': data_tsne,\n", " 'kl_divergence': tsne.kl_divergence_\n", " }\n", " \n", " return data_tsne\n", " \n", " def apply_umap(self, data, dataset_name, n_components=2, n_neighbors=15):\n", " logging.info(f\"Applying UMAP to {dataset_name} dataset with n_neighbors={n_neighbors}\")\n", " \n", " umap_reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, random_state=42)\n", " data_umap = umap_reducer.fit_transform(data)\n", " \n", " logging.info(f\"UMAP completed for {dataset_name}\")\n", " \n", " # Store results\n", " self.results[f'{dataset_name}_umap'] = {\n", " 'transformed_data': data_umap\n", " }\n", " \n", " self.models[f'{dataset_name}_umap'] = umap_reducer\n", " \n", " return data_umap\n", "\n", "class SimpleAutoencoder(nn.Module):\n", " def __init__(self, input_dim, encoding_dim):\n", " super(SimpleAutoencoder, self).__init__()\n", " self.encoder = nn.Sequential(\n", " nn.Linear(input_dim, 128),\n", " nn.ReLU(),\n", " nn.Linear(128, 64),\n", " nn.ReLU(),\n", " nn.Linear(64, encoding_dim)\n", " )\n", " \n", " self.decoder = nn.Sequential(\n", " nn.Linear(encoding_dim, 64),\n", " nn.ReLU(),\n", " nn.Linear(64, 128),\n", " nn.ReLU(),\n", " nn.Linear(128, input_dim)\n", " )\n", " \n", " def forward(self, x):\n", " encoded = self.encoder(x)\n", " decoded = self.decoder(encoded)\n", " return decoded, encoded\n", "\n", "def train_autoencoder(data, dataset_name, encoding_dim=10, epochs=100, lr=0.001):\n", " logging.info(f\"Training autoencoder for {dataset_name} dataset\")\n", " logging.info(f\"Input dimension: {data.shape[1]}, Encoding dimension: {encoding_dim}\")\n", " \n", " # Convert to PyTorch tensors\n", " data_tensor = torch.FloatTensor(data)\n", " \n", " # Initialize model\n", " model = SimpleAutoencoder(data.shape[1], encoding_dim)\n", " criterion = nn.MSELoss()\n", " optimizer = optim.Adam(model.parameters(), lr=lr)\n", " \n", " # Training loop\n", " losses = []\n", " for epoch in range(epochs):\n", " optimizer.zero_grad()\n", " reconstructed, encoded = model(data_tensor)\n", " loss = criterion(reconstructed, data_tensor)\n", " loss.backward()\n", " optimizer.step()\n", " \n", " losses.append(loss.item())\n", " \n", " if (epoch + 1) % 20 == 0:\n", " logging.info(f\"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.6f}\")\n", " \n", " # Get final encodings\n", " with torch.no_grad():\n", " _, final_encoded = model(data_tensor)\n", " final_encoded = final_encoded.numpy()\n", " \n", " logging.info(f\"Autoencoder training completed for {dataset_name}\")\n", " logging.info(f\"Final reconstruction loss: {losses[-1]:.6f}\")\n", " \n", " return final_encoded, model, losses\n", "\n", "def evaluate_dimensionality_reduction(original_data, reduced_data, target, dataset_name, method_name):\n", " logging.info(f\"Evaluating {method_name} performance on {dataset_name} dataset\")\n", " \n", " # Split data for classification test\n", " X_train_orig, X_test_orig, y_train, y_test = train_test_split(\n", " original_data, target, test_size=0.3, random_state=42, stratify=target\n", " )\n", " \n", " X_train_red, X_test_red, _, _ = train_test_split(\n", " reduced_data, target, test_size=0.3, random_state=42, stratify=target\n", " )\n", " \n", " # Train classifiers\n", " rf_orig = RandomForestClassifier(random_state=42)\n", " rf_red = RandomForestClassifier(random_state=42)\n", " \n", " rf_orig.fit(X_train_orig, y_train)\n", " rf_red.fit(X_train_red, y_train)\n", " \n", " # Evaluate\n", " acc_orig = accuracy_score(y_test, rf_orig.predict(X_test_orig))\n", " acc_red = accuracy_score(y_test, rf_red.predict(X_test_red))\n", " \n", " logging.info(f\"Original data accuracy: {acc_orig:.4f}\")\n", " logging.info(f\"Reduced data accuracy: {acc_red:.4f}\")\n", " logging.info(f\"Accuracy retention: {(acc_red/acc_orig)*100:.2f}%\")\n", " \n", " return {\n", " 'original_accuracy': acc_orig,\n", " 'reduced_accuracy': acc_red,\n", " 'accuracy_retention': (acc_red/acc_orig)*100\n", " }\n", "\n", "def create_visualizations(dr_suite):\n", " logging.info(\"Creating comprehensive visualizations\")\n", " \n", " # 1. PCA Explained Variance Plot\n", " plt.figure(figsize=(12, 5))\n", " \n", " plt.subplot(1, 2, 1)\n", " iris_pca_var = dr_suite.results['iris_pca']['explained_variance']\n", " plt.bar(range(1, len(iris_pca_var)+1), iris_pca_var)\n", " plt.title('Iris Dataset - PCA Explained Variance')\n", " plt.xlabel('Principal Component')\n", " plt.ylabel('Explained Variance Ratio')\n", " \n", " plt.subplot(1, 2, 2)\n", " digits_pca_var = dr_suite.results['digits_pca']['explained_variance']\n", " plt.bar(range(1, len(digits_pca_var)+1), digits_pca_var)\n", " plt.title('Digits Dataset - PCA Explained Variance')\n", " plt.xlabel('Principal Component')\n", " plt.ylabel('Explained Variance Ratio')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/pca_explained_variance.png', dpi=300, bbox_inches='tight')\n", " plt.close()\n", " \n", " # 2. Comparison of methods on Iris dataset\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", " \n", " # Original data (first 2 features)\n", " axes[0, 0].scatter(dr_suite.iris_data[:, 0], dr_suite.iris_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[0, 0].set_title('Original Data (First 2 Features)')\n", " axes[0, 0].set_xlabel('Sepal Length')\n", " axes[0, 0].set_ylabel('Sepal Width')\n", " \n", " # PCA\n", " pca_data = dr_suite.results['iris_pca']['transformed_data']\n", " axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[0, 1].set_title('PCA Reduction')\n", " axes[0, 1].set_xlabel('PC1')\n", " axes[0, 1].set_ylabel('PC2')\n", " \n", " # t-SNE\n", " tsne_data = dr_suite.results['iris_tsne']['transformed_data']\n", " axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[1, 0].set_title('t-SNE Reduction')\n", " axes[1, 0].set_xlabel('t-SNE 1')\n", " axes[1, 0].set_ylabel('t-SNE 2')\n", " \n", " # UMAP\n", " umap_data = dr_suite.results['iris_umap']['transformed_data']\n", " axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[1, 1].set_title('UMAP Reduction')\n", " axes[1, 1].set_xlabel('UMAP 1')\n", " axes[1, 1].set_ylabel('UMAP 2')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/iris_comparison.png', dpi=300, bbox_inches='tight')\n", " plt.close()\n", " \n", " # 3. Digits dataset visualization\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", " \n", " # Original digits (sample)\n", " for i in range(4):\n", " axes[0, 0].imshow(dr_suite.digits_images[i], cmap='gray')\n", " break\n", " axes[0, 0].set_title('Original Digit Images (8x8 pixels)')\n", " \n", " # PCA\n", " pca_data = dr_suite.results['digits_pca']['transformed_data']\n", " scatter = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[0, 1].set_title('PCA - Digits Dataset')\n", " axes[0, 1].set_xlabel('PC1')\n", " axes[0, 1].set_ylabel('PC2')\n", " \n", " # t-SNE\n", " tsne_data = dr_suite.results['digits_tsne']['transformed_data']\n", " axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[1, 0].set_title('t-SNE - Digits Dataset')\n", " axes[1, 0].set_xlabel('t-SNE 1')\n", " axes[1, 0].set_ylabel('t-SNE 2')\n", " \n", " # UMAP\n", " umap_data = dr_suite.results['digits_umap']['transformed_data']\n", " axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[1, 1].set_title('UMAP - Digits Dataset')\n", " axes[1, 1].set_xlabel('UMAP 1')\n", " axes[1, 1].set_ylabel('UMAP 2')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/digits_comparison.png', dpi=300, bbox_inches='tight')\n", " plt.close()\n", " \n", " logging.info(\"All visualizations saved to visualizations/ directory\")\n", "\n", "def main():\n", " logging.info(\"Starting Dimensionality Reduction Suite\")\n", " \n", " # Initialize the suite\n", " dr_suite = DimensionalityReductionSuite()\n", " \n", " # Load and prepare data\n", " dr_suite.load_and_prepare_data()\n", " \n", " # Apply PCA\n", " logging.info(\"=== APPLYING PCA ===\")\n", " dr_suite.apply_pca(dr_suite.iris_scaled, 'iris', n_components=2)\n", " dr_suite.apply_pca(dr_suite.digits_scaled, 'digits', n_components=2)\n", " \n", " # Apply t-SNE\n", " logging.info(\"=== APPLYING t-SNE ===\")\n", " dr_suite.apply_tsne(dr_suite.iris_scaled, 'iris', perplexity=30)\n", " dr_suite.apply_tsne(dr_suite.digits_scaled, 'digits', perplexity=30)\n", " \n", " # Apply UMAP\n", " logging.info(\"=== APPLYING UMAP ===\")\n", " dr_suite.apply_umap(dr_suite.iris_scaled, 'iris', n_neighbors=15)\n", " dr_suite.apply_umap(dr_suite.digits_scaled, 'digits', n_neighbors=15)\n", " \n", " # Apply Autoencoder\n", " logging.info(\"=== APPLYING AUTOENCODER ===\")\n", " iris_encoded, iris_autoencoder, iris_losses = train_autoencoder(\n", " dr_suite.iris_scaled, 'iris', encoding_dim=2, epochs=50, lr=0.001\n", " )\n", " \n", " digits_encoded, digits_autoencoder, digits_losses = train_autoencoder(\n", " dr_suite.digits_scaled, 'digits', encoding_dim=10, epochs=100, lr=0.001\n", " )\n", " \n", " # Store autoencoder results\n", " dr_suite.results['iris_autoencoder'] = {\n", " 'transformed_data': iris_encoded,\n", " 'training_losses': iris_losses\n", " }\n", " \n", " dr_suite.results['digits_autoencoder'] = {\n", " 'transformed_data': digits_encoded,\n", " 'training_losses': digits_losses\n", " }\n", " \n", " # Evaluate all methods\n", " logging.info(\"=== EVALUATING METHODS ===\")\n", " evaluation_results = {}\n", " \n", " # Evaluate on Iris dataset\n", " methods = ['pca', 'tsne', 'umap']\n", " for method in methods:\n", " eval_result = evaluate_dimensionality_reduction(\n", " dr_suite.iris_scaled, \n", " dr_suite.results[f'iris_{method}']['transformed_data'],\n", " dr_suite.iris_target,\n", " 'iris',\n", " method.upper()\n", " )\n", " evaluation_results[f'iris_{method}'] = eval_result\n", " \n", " # Evaluate on Digits dataset\n", " for method in methods:\n", " eval_result = evaluate_dimensionality_reduction(\n", " dr_suite.digits_scaled,\n", " dr_suite.results[f'digits_{method}']['transformed_data'],\n", " dr_suite.digits_target,\n", " 'digits',\n", " method.upper()\n", " )\n", " evaluation_results[f'digits_{method}'] = eval_result\n", " \n", " # Create visualizations\n", " create_visualizations(dr_suite)\n", " \n", " # Save models\n", " logging.info(\"Saving trained models\")\n", " with open('models/pca_iris.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['iris_pca'], f)\n", " \n", " with open('models/pca_digits.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['digits_pca'], f)\n", " \n", " with open('models/umap_iris.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['iris_umap'], f)\n", " \n", " with open('models/umap_digits.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['digits_umap'], f)\n", " \n", " torch.save(iris_autoencoder.state_dict(), 'models/autoencoder_iris.pth')\n", " torch.save(digits_autoencoder.state_dict(), 'models/autoencoder_digits.pth')\n", " \n", " # Save results summary\n", " logging.info(\"Saving results summary\")\n", " results_summary = {\n", " 'timestamp': datetime.now().isoformat(),\n", " 'datasets': {\n", " 'iris': {\n", " 'original_features': dr_suite.iris_data.shape[1],\n", " 'samples': dr_suite.iris_data.shape[0],\n", " 'classes': len(np.unique(dr_suite.iris_target))\n", " },\n", " 'digits': {\n", " 'original_features': dr_suite.digits_data.shape[1],\n", " 'samples': dr_suite.digits_data.shape[0],\n", " 'classes': len(np.unique(dr_suite.digits_target))\n", " }\n", " },\n", " 'pca_explained_variance': {\n", " 'iris': dr_suite.results['iris_pca']['explained_variance'].tolist(),\n", " 'digits': dr_suite.results['digits_pca']['explained_variance'].tolist()\n", " },\n", " 'evaluation_results': evaluation_results,\n", " 'autoencoder_final_losses': {\n", " 'iris': iris_losses[-1],\n", " 'digits': digits_losses[-1]\n", " }\n", " }\n", " \n", " with open('results/dimensionality_reduction_summary.json', 'w') as f:\n", " json.dump(results_summary, f, indent=2)\n", " \n", " # Print final summary\n", " logging.info(\"=== FINAL SUMMARY ===\")\n", " logging.info(f\"Iris Dataset - PCA Explained Variance: {dr_suite.results['iris_pca']['explained_variance']}\")\n", " logging.info(f\"Digits Dataset - PCA Explained Variance: {dr_suite.results['digits_pca']['explained_variance']}\")\n", " \n", " for dataset in ['iris', 'digits']:\n", " logging.info(f\"\\n{dataset.upper()} Dataset Classification Performance:\")\n", " for method in ['pca', 'tsne', 'umap']:\n", " result = evaluation_results[f'{dataset}_{method}']\n", " logging.info(f\" {method.upper()}: {result['accuracy_retention']:.2f}% accuracy retention\")\n", " \n", " logging.info(\"\\nAll models saved to models/ directory\")\n", " logging.info(\"All results saved to results/ directory\")\n", " logging.info(\"All visualizations saved to visualizations/ directory\")\n", " logging.info(\"Dimensionality Reduction Suite completed successfully!\")\n", "\n", "if __name__ == \"__main__\":\n", " main()\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "ea6dd258-7eed-4e21-b9a0-388dfd1fd622", "metadata": {}, "outputs": [], "source": [ "# Import all necessary libraries for dimensionality reduction analysis\n", "import numpy as np # Numerical computing foundation\n", "import pandas as pd # Data manipulation (though we use sklearn datasets directly)\n", "import matplotlib.pyplot as plt # Plotting library for static visualizations\n", "import seaborn as sns # Statistical plotting enhancements\n", "import plotly.express as px # Interactive plotting (not used but available)\n", "import plotly.graph_objects as go # More complex interactive plots\n", "from sklearn.datasets import load_iris, load_digits # Standard ML datasets\n", "from sklearn.preprocessing import StandardScaler # Feature scaling (critical for DR)\n", "from sklearn.decomposition import PCA # Principal Component Analysis\n", "from sklearn.manifold import TSNE # t-Distributed Stochastic Neighbor Embedding\n", "from sklearn.model_selection import train_test_split # Data splitting for evaluation\n", "from sklearn.ensemble import RandomForestClassifier # Robust classifier for evaluation\n", "from sklearn.metrics import accuracy_score, classification_report # Performance metrics\n", "import umap # Uniform Manifold Approximation and Projection\n", "import torch # PyTorch for neural network autoencoder\n", "import torch.nn as nn # Neural network modules\n", "import torch.optim as optim # Optimization algorithms\n", "import pickle # Model serialization for sklearn models\n", "import json # Results storage in human-readable format\n", "import logging # Comprehensive logging instead of print statements\n", "import os # Directory and file operations\n", "from datetime import datetime # Timestamps for results\n", "\n", "# Configure logging to both file and console\n", "# This replaces print statements and provides timestamps and log levels\n", "logging.basicConfig(\n", " level=logging.INFO, # Show INFO level and above\n", " format='%(asctime)s - %(levelname)s - %(message)s', # Include timestamp\n", " handlers=[\n", " logging.FileHandler('dimensionality_reduction.log'), # Save to file\n", " logging.StreamHandler() # Also display in console\n", " ]\n", ")\n", "\n", "# Create directories for organized output storage\n", "# exist_ok=True prevents errors if directories already exist\n", "os.makedirs('results', exist_ok=True) # Numerical results and summaries\n", "os.makedirs('models', exist_ok=True) # Trained models for reuse\n", "os.makedirs('visualizations', exist_ok=True) # Generated plots\n", "\n", "class DimensionalityReductionSuite:\n", " \"\"\"\n", " Main class to organize all dimensionality reduction experiments\n", " \n", " Design Choice: Using a class to maintain state and organize methods\n", " - Keeps related data and methods together\n", " - Allows easy access to results across different methods\n", " - Facilitates comparison and evaluation\n", " \"\"\"\n", " \n", " def __init__(self):\n", " \"\"\"Initialize storage for results and trained models\"\"\"\n", " self.results = {} # Store transformed data and metrics\n", " self.models = {} # Store trained models for reuse\n", " \n", " def load_and_prepare_data(self):\n", " \"\"\"\n", " Load standard datasets and prepare them for dimensionality reduction\n", " \n", " Dataset Choice Rationale:\n", " - Iris: Low-dimensional (4 features), well-separated classes, good for understanding\n", " - Digits: High-dimensional (64 features), more challenging, realistic scenario\n", " \"\"\"\n", " logging.info(\"Loading datasets for dimensionality reduction analysis\")\n", " \n", " # Load Iris dataset - classic 4D dataset with 3 flower species\n", " iris = load_iris()\n", " self.iris_data = iris.data # 150 samples × 4 features\n", " self.iris_target = iris.target # Class labels (0, 1, 2)\n", " self.iris_target_names = iris.target_names # ['setosa', 'versicolor', 'virginica']\n", " self.iris_feature_names = iris.feature_names # Sepal/petal length/width\n", " \n", " logging.info(f\"Iris dataset loaded: {self.iris_data.shape} features, {len(np.unique(self.iris_target))} classes\")\n", " \n", " # Load Digits dataset - 8×8 pixel images of handwritten digits (0-9)\n", " digits = load_digits()\n", " self.digits_data = digits.data # 1797 samples × 64 features (flattened 8×8 images)\n", " self.digits_target = digits.target # Digit labels (0-9)\n", " self.digits_images = digits.images # Original 8×8 image format for visualization\n", " \n", " logging.info(f\"Digits dataset loaded: {self.digits_data.shape} features, {len(np.unique(self.digits_target))} classes\")\n", " \n", " # CRITICAL: Standardize the data before applying dimensionality reduction\n", " # Why standardization is essential:\n", " # 1. Features have different scales (e.g., sepal length vs width)\n", " # 2. PCA is sensitive to feature scales - larger values dominate\n", " # 3. Distance-based methods (t-SNE, UMAP) need comparable scales\n", " # 4. Neural networks train better with normalized inputs\n", " \n", " self.scaler_iris = StandardScaler() # Create scaler for iris data\n", " # fit_transform: (1) calculates mean and std, (2) applies transformation\n", " self.iris_scaled = self.scaler_iris.fit_transform(self.iris_data)\n", " \n", " self.scaler_digits = StandardScaler() # Separate scaler for digits\n", " self.digits_scaled = self.scaler_digits.fit_transform(self.digits_data)\n", " \n", " logging.info(\"Data standardization completed\")\n", " \n", " def apply_pca(self, data, dataset_name, n_components=2):\n", " \"\"\"\n", " Apply Principal Component Analysis\n", " \n", " PCA finds linear combinations of original features that explain maximum variance\n", " \n", " Parameters:\n", " - data: Standardized input data\n", " - dataset_name: For organizing results\n", " - n_components: Number of dimensions to reduce to (2 for visualization)\n", " \n", " Design Choice: Using 2 components for easy visualization and comparison\n", " \"\"\"\n", " logging.info(f\"Applying PCA to {dataset_name} dataset\")\n", " \n", " # Create PCA object with specified number of components\n", " pca = PCA(n_components=n_components)\n", " \n", " # fit_transform: (1) finds principal components, (2) projects data\n", " data_pca = pca.fit_transform(data)\n", " \n", " # Extract variance information - crucial for understanding quality\n", " explained_variance = pca.explained_variance_ratio_ # Proportion of variance per component\n", " cumulative_variance = np.cumsum(explained_variance) # Running total of explained variance\n", " \n", " logging.info(f\"PCA completed for {dataset_name}\")\n", " logging.info(f\"Explained variance per component: {explained_variance}\")\n", " logging.info(f\"Cumulative explained variance: {cumulative_variance}\")\n", " \n", " # Store comprehensive results for later analysis\n", " self.results[f'{dataset_name}_pca'] = {\n", " 'transformed_data': data_pca, # Projected data points\n", " 'explained_variance': explained_variance, # How much variance each PC explains\n", " 'cumulative_variance': cumulative_variance, # Total variance captured\n", " 'components': pca.components_ # The actual principal components (directions)\n", " }\n", " \n", " # Store trained model for potential reuse (e.g., transforming new data)\n", " self.models[f'{dataset_name}_pca'] = pca\n", " \n", " return data_pca, explained_variance\n", " \n", " def apply_tsne(self, data, dataset_name, n_components=2, perplexity=30):\n", " \"\"\"\n", " Apply t-Distributed Stochastic Neighbor Embedding\n", " \n", " t-SNE preserves local neighborhood structure, excellent for visualization\n", " \n", " Key Parameters:\n", " - perplexity: Balance between local and global structure (typically 5-50)\n", " - n_components: Output dimensions (2 or 3 for visualization)\n", " \n", " Important: t-SNE is non-linear and non-deterministic\n", " \"\"\"\n", " logging.info(f\"Applying t-SNE to {dataset_name} dataset with perplexity={perplexity}\")\n", " \n", " # Create t-SNE object with careful parameter selection\n", " # random_state=42: Ensures reproducible results\n", " # perplexity=30: Good default for most datasets (roughly sqrt(n_samples))\n", " tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=42)\n", " \n", " # fit_transform: t-SNE doesn't have separate fit/transform like PCA\n", " # It optimizes embedding directly from the data\n", " data_tsne = tsne.fit_transform(data)\n", " \n", " logging.info(f\"t-SNE completed for {dataset_name}\")\n", " # KL divergence: Lower values indicate better optimization\n", " logging.info(f\"Final KL divergence: {tsne.kl_divergence_}\")\n", " \n", " # Store results (note: no reusable model for t-SNE)\n", " self.results[f'{dataset_name}_tsne'] = {\n", " 'transformed_data': data_tsne,\n", " 'kl_divergence': tsne.kl_divergence_ # Quality metric\n", " }\n", " \n", " return data_tsne\n", " \n", " def apply_umap(self, data, dataset_name, n_components=2, n_neighbors=15):\n", " \"\"\"\n", " Apply Uniform Manifold Approximation and Projection\n", " \n", " UMAP preserves both local and global structure better than t-SNE\n", " \n", " Key Parameters:\n", " - n_neighbors: Size of local neighborhood (typically 5-50)\n", " - n_components: Output dimensions\n", " \n", " Advantage: UMAP can transform new data (unlike t-SNE)\n", " \"\"\"\n", " logging.info(f\"Applying UMAP to {dataset_name} dataset with n_neighbors={n_neighbors}\")\n", " \n", " # Create UMAP reducer with balanced parameters\n", " # n_neighbors=15: Good balance between local and global structure\n", " # random_state=42: Reproducible results\n", " umap_reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, random_state=42)\n", " \n", " # fit_transform: UMAP learns mapping and applies it\n", " data_umap = umap_reducer.fit_transform(data)\n", " \n", " logging.info(f\"UMAP completed for {dataset_name}\")\n", " \n", " # Store results and model (UMAP can transform new data)\n", " self.results[f'{dataset_name}_umap'] = {\n", " 'transformed_data': data_umap\n", " }\n", " \n", " # Save model for potential reuse\n", " self.models[f'{dataset_name}_umap'] = umap_reducer\n", " \n", " return data_umap\n", "\n", "class SimpleAutoencoder(nn.Module):\n", " \"\"\"\n", " Neural network autoencoder for dimensionality reduction\n", " \n", " Architecture Design Rationale:\n", " - Encoder: Progressively reduces dimensions (input → 128 → 64 → encoding_dim)\n", " - Decoder: Mirrors encoder in reverse (encoding_dim → 64 → 128 → input)\n", " - ReLU activations: Introduce non-linearity while avoiding vanishing gradients\n", " - No activation on final layer: Allows reconstruction of any real values\n", " \n", " Design Choice: Simple but effective architecture\n", " - Avoids overly complex models that might not converge\n", " - Sufficient capacity for the datasets used\n", " - Easy to understand and modify\n", " \"\"\"\n", " \n", " def __init__(self, input_dim, encoding_dim):\n", " \"\"\"\n", " Initialize autoencoder layers\n", " \n", " Parameters:\n", " - input_dim: Original feature count (4 for iris, 64 for digits)\n", " - encoding_dim: Compressed representation size\n", " \"\"\"\n", " super(SimpleAutoencoder, self).__init__()\n", " \n", " # Encoder: Compress input to lower dimensional representation\n", " self.encoder = nn.Sequential(\n", " nn.Linear(input_dim, 128), # First compression layer\n", " nn.ReLU(), # Non-linear activation\n", " nn.Linear(128, 64), # Second compression layer\n", " nn.ReLU(), # Non-linear activation\n", " nn.Linear(64, encoding_dim) # Final encoding layer (no activation)\n", " )\n", " \n", " # Decoder: Reconstruct original input from encoding\n", " self.decoder = nn.Sequential(\n", " nn.Linear(encoding_dim, 64), # Start expanding\n", " nn.ReLU(), # Non-linear activation\n", " nn.Linear(64, 128), # Continue expanding\n", " nn.ReLU(), # Non-linear activation\n", " nn.Linear(128, input_dim) # Final reconstruction (no activation)\n", " )\n", " \n", " def forward(self, x):\n", " \"\"\"\n", " Forward pass through autoencoder\n", " \n", " Returns both decoded output and encoded representation\n", " This allows us to use the encoded representation for dimensionality reduction\n", " \"\"\"\n", " encoded = self.encoder(x) # Compress input\n", " decoded = self.decoder(encoded) # Reconstruct from compression\n", " return decoded, encoded\n", "\n", "def train_autoencoder(data, dataset_name, encoding_dim=10, epochs=100, lr=0.001):\n", " \"\"\"\n", " Train autoencoder for dimensionality reduction\n", " \n", " Training Process:\n", " 1. Convert data to PyTorch tensors\n", " 2. Initialize model, loss function, and optimizer\n", " 3. Training loop: forward pass → loss calculation → backpropagation\n", " 4. Extract final encoded representations\n", " \n", " Hyperparameter Choices:\n", " - epochs=100: Sufficient for convergence on small datasets\n", " - lr=0.001: Conservative learning rate to avoid instability\n", " - Adam optimizer: Adaptive learning rate, good default choice\n", " - MSE loss: Appropriate for reconstruction tasks\n", " \"\"\"\n", " logging.info(f\"Training autoencoder for {dataset_name} dataset\")\n", " logging.info(f\"Input dimension: {data.shape[1]}, Encoding dimension: {encoding_dim}\")\n", " \n", " # Convert numpy array to PyTorch tensor\n", " # FloatTensor: Standard data type for neural networks\n", " data_tensor = torch.FloatTensor(data)\n", " \n", " # Initialize model with appropriate dimensions\n", " model = SimpleAutoencoder(data.shape[1], encoding_dim)\n", " \n", " # Loss function: Mean Squared Error for reconstruction\n", " # Measures average squared difference between input and reconstruction\n", " criterion = nn.MSELoss()\n", " \n", " # Optimizer: Adam with learning rate\n", " # Adam adapts learning rate per parameter, generally robust\n", " optimizer = optim.Adam(model.parameters(), lr=lr)\n", " \n", " # Track training progress\n", " losses = []\n", " \n", " # Training loop\n", " for epoch in range(epochs):\n", " # Reset gradients (PyTorch accumulates gradients by default)\n", " optimizer.zero_grad()\n", " \n", " # Forward pass: get reconstruction and encoding\n", " reconstructed, encoded = model(data_tensor)\n", " \n", " # Calculate reconstruction loss\n", " # Goal: minimize difference between input and reconstruction\n", " loss = criterion(reconstructed, data_tensor)\n", " \n", " # Backward pass: calculate gradients\n", " loss.backward()\n", " \n", " # Update model parameters\n", " optimizer.step()\n", " \n", " # Store loss for monitoring\n", " losses.append(loss.item())\n", " \n", " # Periodic logging to monitor training progress\n", " if (epoch + 1) % 20 == 0:\n", " logging.info(f\"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.6f}\")\n", " \n", " # Extract final encoded representations for dimensionality reduction\n", " with torch.no_grad(): # Disable gradient computation for inference\n", " _, final_encoded = model(data_tensor)\n", " final_encoded = final_encoded.numpy() # Convert back to numpy\n", " \n", " logging.info(f\"Autoencoder training completed for {dataset_name}\")\n", " logging.info(f\"Final reconstruction loss: {losses[-1]:.6f}\")\n", " \n", " return final_encoded, model, losses\n", "\n", "def evaluate_dimensionality_reduction(original_data, reduced_data, target, dataset_name, method_name):\n", " \"\"\"\n", " Evaluate quality of dimensionality reduction using downstream classification\n", " \n", " Evaluation Strategy:\n", " 1. Train classifier on original high-dimensional data\n", " 2. Train classifier on reduced low-dimensional data\n", " 3. Compare classification accuracies\n", " 4. High accuracy retention indicates good dimensionality reduction\n", " \n", " Why This Evaluation Makes Sense:\n", " - Tests whether important information is preserved\n", " - Uses realistic downstream task (classification)\n", " - Provides interpretable metric (accuracy retention percentage)\n", " \"\"\"\n", " logging.info(f\"Evaluating {method_name} performance on {dataset_name} dataset\")\n", " \n", " # Split data consistently for fair comparison\n", " # stratify=target: Ensures balanced class distribution in train/test sets\n", " # random_state=42: Reproducible splits\n", " X_train_orig, X_test_orig, y_train, y_test = train_test_split(\n", " original_data, target, test_size=0.3, random_state=42, stratify=target\n", " )\n", " \n", " # Split reduced data with identical split (same random_state)\n", " X_train_red, X_test_red, _, _ = train_test_split(\n", " reduced_data, target, test_size=0.3, random_state=42, stratify=target\n", " )\n", " \n", " # Train Random Forest classifiers\n", " # Random Forest Choice: Robust, handles different feature types well, good baseline\n", " rf_orig = RandomForestClassifier(random_state=42) # For original data\n", " rf_red = RandomForestClassifier(random_state=42) # For reduced data\n", " \n", " # Train both classifiers\n", " rf_orig.fit(X_train_orig, y_train)\n", " rf_red.fit(X_train_red, y_train)\n", " \n", " # Evaluate performance\n", " acc_orig = accuracy_score(y_test, rf_orig.predict(X_test_orig))\n", " acc_red = accuracy_score(y_test, rf_red.predict(X_test_red))\n", " \n", " # Log results with clear interpretation\n", " logging.info(f\"Original data accuracy: {acc_orig:.4f}\")\n", " logging.info(f\"Reduced data accuracy: {acc_red:.4f}\")\n", " logging.info(f\"Accuracy retention: {(acc_red/acc_orig)*100:.2f}%\")\n", " \n", " # Return structured results\n", " return {\n", " 'original_accuracy': acc_orig,\n", " 'reduced_accuracy': acc_red,\n", " 'accuracy_retention': (acc_red/acc_orig)*100 # Key metric for comparison\n", " }\n", "\n", "def create_visualizations(dr_suite):\n", " \"\"\"\n", " Generate comprehensive visualizations comparing all methods\n", " \n", " Visualization Strategy:\n", " 1. PCA explained variance plots - understand information retention\n", " 2. Side-by-side method comparisons - visual quality assessment\n", " 3. Dataset-specific plots - accommodate different characteristics\n", " \n", " Design Choices:\n", " - High DPI (300) for publication quality\n", " - Consistent color schemes for easy comparison\n", " - Clear titles and labels for interpretation\n", " \"\"\"\n", " logging.info(\"Creating comprehensive visualizations\")\n", " \n", " # 1. PCA Explained Variance Analysis\n", " # Shows how much information each principal component captures\n", " plt.figure(figsize=(12, 5))\n", " \n", " # Iris dataset explained variance\n", " plt.subplot(1, 2, 1)\n", " iris_pca_var = dr_suite.results['iris_pca']['explained_variance']\n", " plt.bar(range(1, len(iris_pca_var)+1), iris_pca_var)\n", " plt.title('Iris Dataset - PCA Explained Variance')\n", " plt.xlabel('Principal Component')\n", " plt.ylabel('Explained Variance Ratio')\n", " # Add percentage labels on bars for clarity\n", " for i, v in enumerate(iris_pca_var):\n", " plt.text(i+1, v + 0.01, f'{v:.1%}', ha='center')\n", " \n", " # Digits dataset explained variance\n", " plt.subplot(1, 2, 2)\n", " digits_pca_var = dr_suite.results['digits_pca']['explained_variance']\n", " plt.bar(range(1, len(digits_pca_var)+1), digits_pca_var)\n", " plt.title('Digits Dataset - PCA Explained Variance')\n", " plt.xlabel('Principal Component')\n", " plt.ylabel('Explained Variance Ratio')\n", " # Add percentage labels on bars\n", " for i, v in enumerate(digits_pca_var):\n", " plt.text(i+1, v + 0.002, f'{v:.1%}', ha='center')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/pca_explained_variance.png', dpi=300, bbox_inches='tight')\n", " plt.close() # Close figure to free memory\n", " \n", " # 2. Iris Dataset Method Comparison\n", " # 2×2 grid showing different dimensionality reduction results\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", " \n", " # Original data visualization (using first 2 features)\n", " axes[0, 0].scatter(dr_suite.iris_data[:, 0], dr_suite.iris_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[0, 0].set_title('Original Data (First 2 Features)')\n", " axes[0, 0].set_xlabel('Sepal Length')\n", " axes[0, 0].set_ylabel('Sepal Width')\n", " # Add colorbar to show class mapping\n", " \n", " # PCA results\n", " pca_data = dr_suite.results['iris_pca']['transformed_data']\n", " scatter1 = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[0, 1].set_title('PCA Reduction')\n", " axes[0, 1].set_xlabel('PC1')\n", " axes[0, 1].set_ylabel('PC2')\n", " \n", " # t-SNE results\n", " tsne_data = dr_suite.results['iris_tsne']['transformed_data']\n", " axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[1, 0].set_title('t-SNE Reduction')\n", " axes[1, 0].set_xlabel('t-SNE 1')\n", " axes[1, 0].set_ylabel('t-SNE 2')\n", " \n", " # UMAP results\n", " umap_data = dr_suite.results['iris_umap']['transformed_data']\n", " axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], \n", " c=dr_suite.iris_target, cmap='viridis', alpha=0.7)\n", " axes[1, 1].set_title('UMAP Reduction')\n", " axes[1, 1].set_xlabel('UMAP 1')\n", " axes[1, 1].set_ylabel('UMAP 2')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/iris_comparison.png', dpi=300, bbox_inches='tight')\n", " plt.close()\n", " \n", " # 3. Digits Dataset Visualization\n", " # More challenging due to higher dimensionality and more classes\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", " \n", " # Show sample original digit\n", " axes[0, 0].imshow(dr_suite.digits_images[0], cmap='gray')\n", " axes[0, 0].set_title('Original Digit Images (8×8 pixels)')\n", " axes[0, 0].axis('off') # Remove axes for cleaner image display\n", " \n", " # PCA results for digits\n", " pca_data = dr_suite.results['digits_pca']['transformed_data']\n", " scatter2 = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[0, 1].set_title('PCA - Digits Dataset')\n", " axes[0, 1].set_xlabel('PC1')\n", " axes[0, 1].set_ylabel('PC2')\n", " \n", " # t-SNE results for digits\n", " tsne_data = dr_suite.results['digits_tsne']['transformed_data']\n", " axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[1, 0].set_title('t-SNE - Digits Dataset')\n", " axes[1, 0].set_xlabel('t-SNE 1')\n", " axes[1, 0].set_ylabel('t-SNE 2')\n", " \n", " # UMAP results for digits\n", " umap_data = dr_suite.results['digits_umap']['transformed_data']\n", " axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], \n", " c=dr_suite.digits_target, cmap='tab10', alpha=0.7)\n", " axes[1, 1].set_title('UMAP - Digits Dataset')\n", " axes[1, 1].set_xlabel('UMAP 1')\n", " axes[1, 1].set_ylabel('UMAP 2')\n", " \n", " plt.tight_layout()\n", " plt.savefig('visualizations/digits_comparison.png', dpi=300, bbox_inches='tight')\n", " plt.close()\n", " \n", " logging.info(\"All visualizations saved to visualizations/ directory\")\n", "\n", "def main():\n", " \"\"\"\n", " Main execution function that orchestrates the entire analysis\n", " \n", " Execution Flow:\n", " 1. Initialize suite and load data\n", " 2. Apply all dimensionality reduction methods\n", " 3. Evaluate performance using classification\n", " 4. Generate visualizations\n", " 5. Save models and results\n", " 6. Provide comprehensive summary\n", " \n", " Design Choice: Structured workflow ensures reproducibility and completeness\n", " \"\"\"\n", " logging.info(\"Starting Dimensionality Reduction Suite\")\n", " \n", " # Initialize the comprehensive suite\n", " dr_suite = DimensionalityReductionSuite()\n", " \n", " # Step 1: Data preparation\n", " dr_suite.load_and_prepare_data()\n", " \n", " # Step 2: Apply linear method (PCA)\n", " logging.info(\"=== APPLYING PCA ===\")\n", " # Apply to both datasets with 2 components for comparison\n", " dr_suite.apply_pca(dr_suite.iris_scaled, 'iris', n_components=2)\n", " dr_suite.apply_pca(dr_suite.digits_scaled, 'digits', n_components=2)\n", " \n", " # Step 3: Apply non-linear manifold learning (t-SNE)\n", " logging.info(\"=== APPLYING t-SNE ===\")\n", " # Use consistent parameters across datasets\n", " dr_suite.apply_tsne(dr_suite.iris_scaled, 'iris', perplexity=30)\n", " dr_suite.apply_tsne(dr_suite.digits_scaled, 'digits', perplexity=30)\n", " \n", " # Step 4: Apply modern manifold learning (UMAP)\n", " logging.info(\"=== APPLYING UMAP ===\")\n", " # UMAP often provides good balance of local and global structure\n", " dr_suite.apply_umap(dr_suite.iris_scaled, 'iris', n_neighbors=15)\n", " dr_suite.apply_umap(dr_suite.digits_scaled, 'digits', n_neighbors=15)\n", " \n", " # Step 5: Apply neural network approach (Autoencoder)\n", " logging.info(\"=== APPLYING AUTOENCODER ===\")\n", " # Different encoding dimensions based on dataset complexity\n", " iris_encoded, iris_autoencoder, iris_losses = train_autoencoder(\n", " dr_suite.iris_scaled, 'iris', encoding_dim=2, epochs=50, lr=0.001\n", " )\n", " \n", " digits_encoded, digits_autoencoder, digits_losses = train_autoencoder(\n", " dr_suite.digits_scaled, 'digits', encoding_dim=10, epochs=100, lr=0.001\n", " )\n", " \n", " # Store autoencoder results in consistent format\n", " dr_suite.results['iris_autoencoder'] = {\n", " 'transformed_data': iris_encoded,\n", " 'training_losses': iris_losses\n", " }\n", " \n", " dr_suite.results['digits_autoencoder'] = {\n", " 'transformed_data': digits_encoded,\n", " 'training_losses': digits_losses\n", " }\n", " \n", " # Step 6: Comprehensive evaluation\n", " logging.info(\"=== EVALUATING METHODS ===\")\n", " evaluation_results = {}\n", " \n", " # Evaluate traditional methods on both datasets\n", " methods = ['pca', 'tsne', 'umap'] # Methods that work with 2D output\n", " \n", " # Iris dataset evaluation\n", " for method in methods:\n", " eval_result = evaluate_dimensionality_reduction(\n", " dr_suite.iris_scaled, # Original standardized data\n", " dr_suite.results[f'iris_{method}']['transformed_data'], # Reduced data\n", " dr_suite.iris_target, # Class labels for classification\n", " 'iris', # Dataset name\n", " method.upper() # Method name for logging\n", " )\n", " evaluation_results[f'iris_{method}'] = eval_result\n", " \n", " # Digits dataset evaluation\n", " for method in methods:\n", " eval_result = evaluate_dimensionality_reduction(\n", " dr_suite.digits_scaled,\n", " dr_suite.results[f'digits_{method}']['transformed_data'],\n", " dr_suite.digits_target,\n", " 'digits',\n", " method.upper()\n", " )\n", " evaluation_results[f'digits_{method}'] = eval_result\n", " \n", " # Step 7: Generate comprehensive visualizations\n", " create_visualizations(dr_suite)\n", " \n", " # Step 8: Save all trained models for future use\n", " logging.info(\"Saving trained models\")\n", " \n", " # Save sklearn models using pickle (standard approach)\n", " with open('models/pca_iris.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['iris_pca'], f)\n", " \n", " with open('models/pca_digits.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['digits_pca'], f)\n", " \n", " with open('models/umap_iris.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['iris_umap'], f)\n", " \n", " with open('models/umap_digits.pkl', 'wb') as f:\n", " pickle.dump(dr_suite.models['digits_umap'], f)\n", " \n", " # Save PyTorch models using torch.save (state dictionaries)\n", " torch.save(iris_autoencoder.state_dict(), 'models/autoencoder_iris.pth')\n", " torch.save(digits_autoencoder.state_dict(), 'models/autoencoder_digits.pth')\n", " \n", " # Step 9: Create comprehensive results summary\n", " logging.info(\"Saving results summary\")\n", " results_summary = {\n", " 'timestamp': datetime.now().isoformat(), # When analysis was run\n", " 'datasets': {\n", " 'iris': {\n", " 'original_features': dr_suite.iris_data.shape[1],\n", " 'samples': dr_suite.iris_data.shape[0],\n", " 'classes': len(np.unique(dr_suite.iris_target))\n", " },\n", " 'digits': {\n", " 'original_features': dr_suite.digits_data.shape[1],\n", " 'samples': dr_suite.digits_data.shape[0],\n", " 'classes': len(np.unique(dr_suite.digits_target))\n", " }\n", " },\n", " # PCA explained variance is crucial for understanding information retention\n", " 'pca_explained_variance': {\n", " 'iris': dr_suite.results['iris_pca']['explained_variance'].tolist(),\n", " 'digits': dr_suite.results['digits_pca']['explained_variance'].tolist()\n", " },\n", " # Classification performance comparison across all methods\n", " 'evaluation_results': evaluation_results,\n", " # Autoencoder training convergence metrics\n", " 'autoencoder_final_losses': {\n", " 'iris': iris_losses[-1], # Final reconstruction loss for iris\n", " 'digits': digits_losses[-1] # Final reconstruction loss for digits\n", " }\n", " }\n", " \n", " # Save as JSON for easy reading and further analysis\n", " with open('results/dimensionality_reduction_summary.json', 'w') as f:\n", " json.dump(results_summary, f, indent=2) # indent=2 for readability\n", " \n", " # Step 10: Print comprehensive summary to console and log\n", " logging.info(\"=== FINAL SUMMARY ===\")\n", " \n", " # PCA explained variance summary\n", " logging.info(f\"Iris Dataset - PCA Explained Variance: {dr_suite.results['iris_pca']['explained_variance']}\")\n", " logging.info(f\"Digits Dataset - PCA Explained Variance: {dr_suite.results['digits_pca']['explained_variance']}\")\n", " \n", " # Classification performance summary for easy comparison\n", " for dataset in ['iris', 'digits']:\n", " logging.info(f\"\\n{dataset.upper()} Dataset Classification Performance:\")\n", " for method in ['pca', 'tsne', 'umap']:\n", " result = evaluation_results[f'{dataset}_{method}']\n", " logging.info(f\" {method.upper()}: {result['accuracy_retention']:.2f}% accuracy retention\")\n", " \n", " # Final status messages\n", " logging.info(\"\\nAll models saved to models/ directory\")\n", " logging.info(\"All results saved to results/ directory\") \n", " logging.info(\"All visualizations saved to visualizations/ directory\")\n", " logging.info(\"Dimensionality Reduction Suite completed successfully!\")\n", "\n", "# Execute the main function when script is run directly\n", "if __name__ == \"__main__\":\n", " main()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }