# DIMENSIONALITY REDUCTION

--------------------------------------------
PHASE 1: EXPLAIN & BREAKDOWN (LEARNING PHASE)
--------------------------------------------

## 1. Simple Explanation (100-150 words)

Dimensionality reduction is like taking a 3D object and creating a 2D shadow that preserves the most important information. Imagine you have a dataset with 1000 features (columns) describing each data point, but many features are redundant or noisy. Dimensionality reduction techniques help you compress this data into fewer dimensions (maybe 10-50) while keeping the essential patterns intact.

Think of it like summarizing a 500-page book into a 20-page summary - you lose some details, but the main ideas remain. This is crucial in AI because high-dimensional data is hard to visualize, slow to process, and prone to the "curse of dimensionality" (where algorithms perform poorly in high dimensions). Common techniques include PCA (Principal Component Analysis), t-SNE, and autoencoders. It's used everywhere: image compression, data visualization, noise reduction, and preparing data for machine learning models.

## 2. Detailed Roadmap with Concrete Examples

**Step 1: Understanding the Problem**
- **Curse of Dimensionality**: Example - Finding nearest neighbors in 2D vs 1000D space
- **Computational Complexity**: Example - Processing 28×28 pixel images (784 features) vs 10 compressed features
- **Visualization Challenges**: Example - Plotting customer data with 50 attributes

**Step 2: Linear Dimensionality Reduction**
- **Principal Component Analysis (PCA)**: Example - Reducing face images from 10,000 pixels to 100 principal components
- **Linear Discriminant Analysis (LDA)**: Example - Separating iris flower species using 2 components instead of 4 features
- **Factor Analysis**: Example - Finding underlying factors in psychological test scores

**Step 3: Non-Linear Dimensionality Reduction**
- **t-SNE**: Example - Visualizing high-dimensional word embeddings in 2D scatter plots
- **UMAP**: Example - Exploring single-cell RNA sequencing data clusters
- **Isomap**: Example - Unfolding Swiss roll dataset to reveal underlying 2D structure

**Step 4: Neural Network Approaches**
- **Autoencoders**: Example - Compressing MNIST digit images from 784 to 32 dimensions
- **Variational Autoencoders (VAE)**: Example - Generating new faces by sampling from learned latent space
- **Deep Feature Learning**: Example - Using CNN layers as feature extractors

**Step 5: Evaluation and Selection**
- **Explained Variance**: Example - Choosing number of PCA components to retain 95% variance
- **Reconstruction Error**: Example - Measuring how well compressed images match originals
- **Downstream Task Performance**: Example - Classification accuracy after dimensionality reduction

## 3. Formula Memory Aids Section

### PCA Covariance Matrix Formula
**FORMULA**: C = (1/n) × X^T × X

**REAL-LIFE ANALOGY**: "How do your friends' personalities relate to each other?"
- C = Friendship compatibility matrix
- X = Each friend's personality traits (rows=friends, columns=traits)
- X^T = Flipping the friend-trait table
- 1/n = Averaging across all your friends

**MEMORY TRICK**: "Covariance = Correlation of Variance - how features dance together!"

### PCA Eigenvalue Decomposition Formula
**FORMULA**: C × v = λ × v

**REAL-LIFE ANALOGY**: "Which direction does your friend group naturally lean?"
- C = Group's personality compatibility matrix
- v = Direction of strongest group tendency (eigenvector)
- λ = How strong that tendency is (eigenvalue)
- The equation means: "Group tendency × Direction = Strength × Same Direction"

**MEMORY TRICK**: "Eigen = 'Own' in German - finding data's 'own' natural directions!"

### Explained Variance Ratio Formula
**FORMULA**: Explained Variance = λᵢ / Σλⱼ

**REAL-LIFE ANALOGY**: "What percentage of your friend group's energy goes into sports vs studies?"
- λᵢ = Energy spent on sports (one eigenvalue)
- Σλⱼ = Total energy of the group (sum of all eigenvalues)
- Ratio = Sports energy / Total energy

**MEMORY TRICK**: "Explained = Ex-plained on a plane - how much info fits on each dimension!"

### t-SNE Similarity Formula
**FORMULA**: pᵢⱼ = exp(-||xᵢ - xⱼ||²/2σᵢ²) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||²/2σᵢ²)

**REAL-LIFE ANALOGY**: "How similar are two people in a crowded room?"
- pᵢⱼ = Similarity between person i and person j
- ||xᵢ - xⱼ||² = How different their personalities are (squared distance)
- σᵢ² = How picky person i is about friendships (bandwidth)
- exp(-distance/pickiness) = Friendship probability decreases with distance/pickiness

**MEMORY TRICK**: "t-SNE = t-See Neighbors Everywhere - finding similar points!"

## 4. Step-by-Step Numerical Example (PCA on 2D data)

**Dataset**: 4 points in 2D space
```
Point 1: (1, 2)
Point 2: (3, 4) 
Point 3: (5, 6)
Point 4: (7, 8)
```

**Step 1: Center the data (subtract mean)**
```
Mean = (4, 5)
Centered data:
Point 1: (-3, -3)
Point 2: (-1, -1)
Point 3: (1, 1)
Point 4: (3, 3)
```

**Step 2: Calculate covariance matrix**
```
X = [[-3, -3],
 [-1, -1],
 [1, 1],
 [3, 3]]

C = (1/4) × X^T × X
 = (1/4) × [[20, 20],
 [20, 20]]
 = [[5, 5],
 [5, 5]]
```

**Step 3: Find eigenvalues and eigenvectors**
```
Characteristic equation: det(C - λI) = 0
(5-λ)² - 25 = 0
λ² - 10λ = 0
λ₁ = 10, λ₂ = 0

Eigenvector for λ₁ = 10: v₁ = [1/√2, 1/√2]
Eigenvector for λ₂ = 0: v₂ = [1/√2, -1/√2]
```

**Step 4: Project data onto first principal component**
```
PC1 = X × v₁ = [[-3, -3], [-1, -1], [1, 1], [3, 3]] × [1/√2, 1/√2]
 = [-6/√2, -2/√2, 2/√2, 6/√2]
 = [-4.24, -1.41, 1.41, 4.24]
```

**Result**: 2D data reduced to 1D with 100% explained variance!

## 5. Real-World AI Use Case

**Netflix Recommendation System**:
Netflix has millions of users and thousands of movies, creating a massive user-movie rating matrix. Using matrix factorization (a form of dimensionality reduction), they:

1. **Compress user preferences**: Reduce each user's 10,000+ movie ratings to ~50 latent factors (like "action lover", "comedy fan", "indie preference")
2. **Compress movie features**: Reduce each movie's characteristics to the same 50 factors
3. **Make predictions**: Multiply user factors × movie factors to predict ratings
4. **Handle sparsity**: Most users haven't rated most movies, but the compressed representation can still make predictions

This reduces storage, speeds up computation, and reveals hidden patterns like "users who like sci-fi also tend to like thrillers."

## 6. Tips for Mastering This Topic

**Practice Sources**:
- Scikit-learn documentation and examples
- Kaggle datasets (Iris, Wine, Breast Cancer for beginners)
- Andrew Ng's CS229 Stanford lectures on PCA
- Sebastian Raschka's "Python Machine Learning" book

**Hands-on Projects**:
1. **Visualize high-dimensional data**: Use t-SNE on MNIST digits
2. **Image compression**: Apply PCA to face images
3. **Feature selection**: Compare PCA vs original features for classification
4. **Clustering**: Use dimensionality reduction before K-means

**Key Resources**:
- **Theory**: "Elements of Statistical Learning" (Hastie, Tibshirani, Friedman)
- **Implementation**: Scikit-learn user guide on decomposition
- **Visualization**: Matplotlib and Plotly for 2D/3D scatter plots
- **Practice**: Coursera ML course assignments

**Common Pitfalls to Avoid**:
- Don't apply PCA to categorical variables
- Always scale/normalize data before PCA
- Remember: PCA removes the mean, so center your data first
- Choose components based on explained variance, not just arbitrary numbers

Ready to move to implementation? Say "Understood" and I'll provide the complete Python code with logging!

In [2]:
!pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn
!pip install torch torchvision # For autoencoder implementation

3925.08s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting umap-learn
 Downloading umap_learn-0.5.9.post2-py3-none-any.whl.metadata (25 kB)
Collecting numba>=0.51.2 (from umap-learn)
 Downloading numba-0.61.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting pynndescent>=0.5 (from umap-learn)
 Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting llvmlite<0.45,>=0.44.0dev0 (from numba>=0.51.2->umap-learn)
 Downloading llvmlite-0.44.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.8 kB)
Collecting numpy
 Downloading numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Downloading umap_learn-0.5.9.post2-py3-none-any.whl (90 kB)
Downloading numba-0.61.2-cp313-cp313-macosx_11_0_arm64.whl (2.8 MB)
[2K [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)
[2K [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m2.5 MB/

3978.36s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import umap
import torch
import torch.nn as nn
import torch.optim as optim
import pickle
import json
import logging
import os
from datetime import datetime

# Configure logging
logging.basicConfig(
 level=logging.INFO,
 format='%(asctime)s - %(levelname)s - %(message)s',
 handlers=[
 logging.FileHandler('dimensionality_reduction.log'),
 logging.StreamHandler()
 ]
)

# Create results directory
os.makedirs('results', exist_ok=True)
os.makedirs('models', exist_ok=True)
os.makedirs('visualizations', exist_ok=True)

class DimensionalityReductionSuite:
 def __init__(self):
 self.results = {}
 self.models = {}
 
 def load_and_prepare_data(self):
 logging.info("Loading datasets for dimensionality reduction analysis")
 
 # Load Iris dataset (low-dimensional example)
 iris = load_iris()
 self.iris_data = iris.data
 self.iris_target = iris.target
 self.iris_target_names = iris.target_names
 self.iris_feature_names = iris.feature_names
 
 logging.info(f"Iris dataset loaded: {self.iris_data.shape} features, {len(np.unique(self.iris_target))} classes")
 
 # Load Digits dataset (high-dimensional example)
 digits = load_digits()
 self.digits_data = digits.data
 self.digits_target = digits.target
 self.digits_images = digits.images
 
 logging.info(f"Digits dataset loaded: {self.digits_data.shape} features, {len(np.unique(self.digits_target))} classes")
 
 # Standardize the data
 self.scaler_iris = StandardScaler()
 self.iris_scaled = self.scaler_iris.fit_transform(self.iris_data)
 
 self.scaler_digits = StandardScaler()
 self.digits_scaled = self.scaler_digits.fit_transform(self.digits_data)
 
 logging.info("Data standardization completed")
 
 def apply_pca(self, data, dataset_name, n_components=2):
 logging.info(f"Applying PCA to {dataset_name} dataset")
 
 pca = PCA(n_components=n_components)
 data_pca = pca.fit_transform(data)
 
 # Calculate explained variance
 explained_variance = pca.explained_variance_ratio_
 cumulative_variance = np.cumsum(explained_variance)
 
 logging.info(f"PCA completed for {dataset_name}")
 logging.info(f"Explained variance per component: {explained_variance}")
 logging.info(f"Cumulative explained variance: {cumulative_variance}")
 
 # Store results
 self.results[f'{dataset_name}_pca'] = {
 'transformed_data': data_pca,
 'explained_variance': explained_variance,
 'cumulative_variance': cumulative_variance,
 'components': pca.components_
 }
 
 self.models[f'{dataset_name}_pca'] = pca
 
 return data_pca, explained_variance
 
 def apply_tsne(self, data, dataset_name, n_components=2, perplexity=30):
 logging.info(f"Applying t-SNE to {dataset_name} dataset with perplexity={perplexity}")
 
 tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=42)
 data_tsne = tsne.fit_transform(data)
 
 logging.info(f"t-SNE completed for {dataset_name}")
 logging.info(f"Final KL divergence: {tsne.kl_divergence_}")
 
 # Store results
 self.results[f'{dataset_name}_tsne'] = {
 'transformed_data': data_tsne,
 'kl_divergence': tsne.kl_divergence_
 }
 
 return data_tsne
 
 def apply_umap(self, data, dataset_name, n_components=2, n_neighbors=15):
 logging.info(f"Applying UMAP to {dataset_name} dataset with n_neighbors={n_neighbors}")
 
 umap_reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, random_state=42)
 data_umap = umap_reducer.fit_transform(data)
 
 logging.info(f"UMAP completed for {dataset_name}")
 
 # Store results
 self.results[f'{dataset_name}_umap'] = {
 'transformed_data': data_umap
 }
 
 self.models[f'{dataset_name}_umap'] = umap_reducer
 
 return data_umap

class SimpleAutoencoder(nn.Module):
 def __init__(self, input_dim, encoding_dim):
 super(SimpleAutoencoder, self).__init__()
 self.encoder = nn.Sequential(
 nn.Linear(input_dim, 128),
 nn.ReLU(),
 nn.Linear(128, 64),
 nn.ReLU(),
 nn.Linear(64, encoding_dim)
 )
 
 self.decoder = nn.Sequential(
 nn.Linear(encoding_dim, 64),
 nn.ReLU(),
 nn.Linear(64, 128),
 nn.ReLU(),
 nn.Linear(128, input_dim)
 )
 
 def forward(self, x):
 encoded = self.encoder(x)
 decoded = self.decoder(encoded)
 return decoded, encoded

def train_autoencoder(data, dataset_name, encoding_dim=10, epochs=100, lr=0.001):
 logging.info(f"Training autoencoder for {dataset_name} dataset")
 logging.info(f"Input dimension: {data.shape[1]}, Encoding dimension: {encoding_dim}")
 
 # Convert to PyTorch tensors
 data_tensor = torch.FloatTensor(data)
 
 # Initialize model
 model = SimpleAutoencoder(data.shape[1], encoding_dim)
 criterion = nn.MSELoss()
 optimizer = optim.Adam(model.parameters(), lr=lr)
 
 # Training loop
 losses = []
 for epoch in range(epochs):
 optimizer.zero_grad()
 reconstructed, encoded = model(data_tensor)
 loss = criterion(reconstructed, data_tensor)
 loss.backward()
 optimizer.step()
 
 losses.append(loss.item())
 
 if (epoch + 1) % 20 == 0:
 logging.info(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.6f}")
 
 # Get final encodings
 with torch.no_grad():
 _, final_encoded = model(data_tensor)
 final_encoded = final_encoded.numpy()
 
 logging.info(f"Autoencoder training completed for {dataset_name}")
 logging.info(f"Final reconstruction loss: {losses[-1]:.6f}")
 
 return final_encoded, model, losses

def evaluate_dimensionality_reduction(original_data, reduced_data, target, dataset_name, method_name):
 logging.info(f"Evaluating {method_name} performance on {dataset_name} dataset")
 
 # Split data for classification test
 X_train_orig, X_test_orig, y_train, y_test = train_test_split(
 original_data, target, test_size=0.3, random_state=42, stratify=target
 )
 
 X_train_red, X_test_red, _, _ = train_test_split(
 reduced_data, target, test_size=0.3, random_state=42, stratify=target
 )
 
 # Train classifiers
 rf_orig = RandomForestClassifier(random_state=42)
 rf_red = RandomForestClassifier(random_state=42)
 
 rf_orig.fit(X_train_orig, y_train)
 rf_red.fit(X_train_red, y_train)
 
 # Evaluate
 acc_orig = accuracy_score(y_test, rf_orig.predict(X_test_orig))
 acc_red = accuracy_score(y_test, rf_red.predict(X_test_red))
 
 logging.info(f"Original data accuracy: {acc_orig:.4f}")
 logging.info(f"Reduced data accuracy: {acc_red:.4f}")
 logging.info(f"Accuracy retention: {(acc_red/acc_orig)*100:.2f}%")
 
 return {
 'original_accuracy': acc_orig,
 'reduced_accuracy': acc_red,
 'accuracy_retention': (acc_red/acc_orig)*100
 }

def create_visualizations(dr_suite):
 logging.info("Creating comprehensive visualizations")
 
 # 1. PCA Explained Variance Plot
 plt.figure(figsize=(12, 5))
 
 plt.subplot(1, 2, 1)
 iris_pca_var = dr_suite.results['iris_pca']['explained_variance']
 plt.bar(range(1, len(iris_pca_var)+1), iris_pca_var)
 plt.title('Iris Dataset - PCA Explained Variance')
 plt.xlabel('Principal Component')
 plt.ylabel('Explained Variance Ratio')
 
 plt.subplot(1, 2, 2)
 digits_pca_var = dr_suite.results['digits_pca']['explained_variance']
 plt.bar(range(1, len(digits_pca_var)+1), digits_pca_var)
 plt.title('Digits Dataset - PCA Explained Variance')
 plt.xlabel('Principal Component')
 plt.ylabel('Explained Variance Ratio')
 
 plt.tight_layout()
 plt.savefig('visualizations/pca_explained_variance.png', dpi=300, bbox_inches='tight')
 plt.close()
 
 # 2. Comparison of methods on Iris dataset
 fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
 # Original data (first 2 features)
 axes[0, 0].scatter(dr_suite.iris_data[:, 0], dr_suite.iris_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[0, 0].set_title('Original Data (First 2 Features)')
 axes[0, 0].set_xlabel('Sepal Length')
 axes[0, 0].set_ylabel('Sepal Width')
 
 # PCA
 pca_data = dr_suite.results['iris_pca']['transformed_data']
 axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[0, 1].set_title('PCA Reduction')
 axes[0, 1].set_xlabel('PC1')
 axes[0, 1].set_ylabel('PC2')
 
 # t-SNE
 tsne_data = dr_suite.results['iris_tsne']['transformed_data']
 axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[1, 0].set_title('t-SNE Reduction')
 axes[1, 0].set_xlabel('t-SNE 1')
 axes[1, 0].set_ylabel('t-SNE 2')
 
 # UMAP
 umap_data = dr_suite.results['iris_umap']['transformed_data']
 axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[1, 1].set_title('UMAP Reduction')
 axes[1, 1].set_xlabel('UMAP 1')
 axes[1, 1].set_ylabel('UMAP 2')
 
 plt.tight_layout()
 plt.savefig('visualizations/iris_comparison.png', dpi=300, bbox_inches='tight')
 plt.close()
 
 # 3. Digits dataset visualization
 fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
 # Original digits (sample)
 for i in range(4):
 axes[0, 0].imshow(dr_suite.digits_images[i], cmap='gray')
 break
 axes[0, 0].set_title('Original Digit Images (8x8 pixels)')
 
 # PCA
 pca_data = dr_suite.results['digits_pca']['transformed_data']
 scatter = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[0, 1].set_title('PCA - Digits Dataset')
 axes[0, 1].set_xlabel('PC1')
 axes[0, 1].set_ylabel('PC2')
 
 # t-SNE
 tsne_data = dr_suite.results['digits_tsne']['transformed_data']
 axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[1, 0].set_title('t-SNE - Digits Dataset')
 axes[1, 0].set_xlabel('t-SNE 1')
 axes[1, 0].set_ylabel('t-SNE 2')
 
 # UMAP
 umap_data = dr_suite.results['digits_umap']['transformed_data']
 axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[1, 1].set_title('UMAP - Digits Dataset')
 axes[1, 1].set_xlabel('UMAP 1')
 axes[1, 1].set_ylabel('UMAP 2')
 
 plt.tight_layout()
 plt.savefig('visualizations/digits_comparison.png', dpi=300, bbox_inches='tight')
 plt.close()
 
 logging.info("All visualizations saved to visualizations/ directory")

def main():
 logging.info("Starting Dimensionality Reduction Suite")
 
 # Initialize the suite
 dr_suite = DimensionalityReductionSuite()
 
 # Load and prepare data
 dr_suite.load_and_prepare_data()
 
 # Apply PCA
 logging.info("=== APPLYING PCA ===")
 dr_suite.apply_pca(dr_suite.iris_scaled, 'iris', n_components=2)
 dr_suite.apply_pca(dr_suite.digits_scaled, 'digits', n_components=2)
 
 # Apply t-SNE
 logging.info("=== APPLYING t-SNE ===")
 dr_suite.apply_tsne(dr_suite.iris_scaled, 'iris', perplexity=30)
 dr_suite.apply_tsne(dr_suite.digits_scaled, 'digits', perplexity=30)
 
 # Apply UMAP
 logging.info("=== APPLYING UMAP ===")
 dr_suite.apply_umap(dr_suite.iris_scaled, 'iris', n_neighbors=15)
 dr_suite.apply_umap(dr_suite.digits_scaled, 'digits', n_neighbors=15)
 
 # Apply Autoencoder
 logging.info("=== APPLYING AUTOENCODER ===")
 iris_encoded, iris_autoencoder, iris_losses = train_autoencoder(
 dr_suite.iris_scaled, 'iris', encoding_dim=2, epochs=50, lr=0.001
 )
 
 digits_encoded, digits_autoencoder, digits_losses = train_autoencoder(
 dr_suite.digits_scaled, 'digits', encoding_dim=10, epochs=100, lr=0.001
 )
 
 # Store autoencoder results
 dr_suite.results['iris_autoencoder'] = {
 'transformed_data': iris_encoded,
 'training_losses': iris_losses
 }
 
 dr_suite.results['digits_autoencoder'] = {
 'transformed_data': digits_encoded,
 'training_losses': digits_losses
 }
 
 # Evaluate all methods
 logging.info("=== EVALUATING METHODS ===")
 evaluation_results = {}
 
 # Evaluate on Iris dataset
 methods = ['pca', 'tsne', 'umap']
 for method in methods:
 eval_result = evaluate_dimensionality_reduction(
 dr_suite.iris_scaled, 
 dr_suite.results[f'iris_{method}']['transformed_data'],
 dr_suite.iris_target,
 'iris',
 method.upper()
 )
 evaluation_results[f'iris_{method}'] = eval_result
 
 # Evaluate on Digits dataset
 for method in methods:
 eval_result = evaluate_dimensionality_reduction(
 dr_suite.digits_scaled,
 dr_suite.results[f'digits_{method}']['transformed_data'],
 dr_suite.digits_target,
 'digits',
 method.upper()
 )
 evaluation_results[f'digits_{method}'] = eval_result
 
 # Create visualizations
 create_visualizations(dr_suite)
 
 # Save models
 logging.info("Saving trained models")
 with open('models/pca_iris.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['iris_pca'], f)
 
 with open('models/pca_digits.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['digits_pca'], f)
 
 with open('models/umap_iris.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['iris_umap'], f)
 
 with open('models/umap_digits.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['digits_umap'], f)
 
 torch.save(iris_autoencoder.state_dict(), 'models/autoencoder_iris.pth')
 torch.save(digits_autoencoder.state_dict(), 'models/autoencoder_digits.pth')
 
 # Save results summary
 logging.info("Saving results summary")
 results_summary = {
 'timestamp': datetime.now().isoformat(),
 'datasets': {
 'iris': {
 'original_features': dr_suite.iris_data.shape[1],
 'samples': dr_suite.iris_data.shape[0],
 'classes': len(np.unique(dr_suite.iris_target))
 },
 'digits': {
 'original_features': dr_suite.digits_data.shape[1],
 'samples': dr_suite.digits_data.shape[0],
 'classes': len(np.unique(dr_suite.digits_target))
 }
 },
 'pca_explained_variance': {
 'iris': dr_suite.results['iris_pca']['explained_variance'].tolist(),
 'digits': dr_suite.results['digits_pca']['explained_variance'].tolist()
 },
 'evaluation_results': evaluation_results,
 'autoencoder_final_losses': {
 'iris': iris_losses[-1],
 'digits': digits_losses[-1]
 }
 }
 
 with open('results/dimensionality_reduction_summary.json', 'w') as f:
 json.dump(results_summary, f, indent=2)
 
 # Print final summary
 logging.info("=== FINAL SUMMARY ===")
 logging.info(f"Iris Dataset - PCA Explained Variance: {dr_suite.results['iris_pca']['explained_variance']}")
 logging.info(f"Digits Dataset - PCA Explained Variance: {dr_suite.results['digits_pca']['explained_variance']}")
 
 for dataset in ['iris', 'digits']:
 logging.info(f"\n{dataset.upper()} Dataset Classification Performance:")
 for method in ['pca', 'tsne', 'umap']:
 result = evaluation_results[f'{dataset}_{method}']
 logging.info(f" {method.upper()}: {result['accuracy_retention']:.2f}% accuracy retention")
 
 logging.info("\nAll models saved to models/ directory")
 logging.info("All results saved to results/ directory")
 logging.info("All visualizations saved to visualizations/ directory")
 logging.info("Dimensionality Reduction Suite completed successfully!")

if __name__ == "__main__":
 main()
 

2025-07-16 10:36:41,644 - INFO - Starting Dimensionality Reduction Suite
2025-07-16 10:36:41,645 - INFO - Loading datasets for dimensionality reduction analysis
2025-07-16 10:36:41,647 - INFO - Iris dataset loaded: (150, 4) features, 3 classes
2025-07-16 10:36:41,656 - INFO - Digits dataset loaded: (1797, 64) features, 10 classes
2025-07-16 10:36:41,658 - INFO - Data standardization completed
2025-07-16 10:36:41,658 - INFO - === APPLYING PCA ===
2025-07-16 10:36:41,658 - INFO - Applying PCA to iris dataset
2025-07-16 10:36:41,661 - INFO - PCA completed for iris
2025-07-16 10:36:41,661 - INFO - Explained variance per component: [0.72962445 0.22850762]
2025-07-16 10:36:41,661 - INFO - Cumulative explained variance: [0.72962445 0.95813207]
2025-07-16 10:36:41,662 - INFO - Applying PCA to digits dataset
 C = X.T @ X
 C = X.T @ X
 C = X.T @ X
 X_transformed = X @ self.components_.T
 X_transformed = X @ self.components_.T
 X_transformed = X @ self.components_.T
2025-07-16 10:36:41,670 - INFO

In [None]:
# Import all necessary libraries for dimensionality reduction analysis
import numpy as np # Numerical computing foundation
import pandas as pd # Data manipulation (though we use sklearn datasets directly)
import matplotlib.pyplot as plt # Plotting library for static visualizations
import seaborn as sns # Statistical plotting enhancements
import plotly.express as px # Interactive plotting (not used but available)
import plotly.graph_objects as go # More complex interactive plots
from sklearn.datasets import load_iris, load_digits # Standard ML datasets
from sklearn.preprocessing import StandardScaler # Feature scaling (critical for DR)
from sklearn.decomposition import PCA # Principal Component Analysis
from sklearn.manifold import TSNE # t-Distributed Stochastic Neighbor Embedding
from sklearn.model_selection import train_test_split # Data splitting for evaluation
from sklearn.ensemble import RandomForestClassifier # Robust classifier for evaluation
from sklearn.metrics import accuracy_score, classification_report # Performance metrics
import umap # Uniform Manifold Approximation and Projection
import torch # PyTorch for neural network autoencoder
import torch.nn as nn # Neural network modules
import torch.optim as optim # Optimization algorithms
import pickle # Model serialization for sklearn models
import json # Results storage in human-readable format
import logging # Comprehensive logging instead of print statements
import os # Directory and file operations
from datetime import datetime # Timestamps for results

# Configure logging to both file and console
# This replaces print statements and provides timestamps and log levels
logging.basicConfig(
 level=logging.INFO, # Show INFO level and above
 format='%(asctime)s - %(levelname)s - %(message)s', # Include timestamp
 handlers=[
 logging.FileHandler('dimensionality_reduction.log'), # Save to file
 logging.StreamHandler() # Also display in console
 ]
)

# Create directories for organized output storage
# exist_ok=True prevents errors if directories already exist
os.makedirs('results', exist_ok=True) # Numerical results and summaries
os.makedirs('models', exist_ok=True) # Trained models for reuse
os.makedirs('visualizations', exist_ok=True) # Generated plots

class DimensionalityReductionSuite:
 """
 Main class to organize all dimensionality reduction experiments
 
 Design Choice: Using a class to maintain state and organize methods
 - Keeps related data and methods together
 - Allows easy access to results across different methods
 - Facilitates comparison and evaluation
 """
 
 def __init__(self):
 """Initialize storage for results and trained models"""
 self.results = {} # Store transformed data and metrics
 self.models = {} # Store trained models for reuse
 
 def load_and_prepare_data(self):
 """
 Load standard datasets and prepare them for dimensionality reduction
 
 Dataset Choice Rationale:
 - Iris: Low-dimensional (4 features), well-separated classes, good for understanding
 - Digits: High-dimensional (64 features), more challenging, realistic scenario
 """
 logging.info("Loading datasets for dimensionality reduction analysis")
 
 # Load Iris dataset - classic 4D dataset with 3 flower species
 iris = load_iris()
 self.iris_data = iris.data # 150 samples × 4 features
 self.iris_target = iris.target # Class labels (0, 1, 2)
 self.iris_target_names = iris.target_names # ['setosa', 'versicolor', 'virginica']
 self.iris_feature_names = iris.feature_names # Sepal/petal length/width
 
 logging.info(f"Iris dataset loaded: {self.iris_data.shape} features, {len(np.unique(self.iris_target))} classes")
 
 # Load Digits dataset - 8×8 pixel images of handwritten digits (0-9)
 digits = load_digits()
 self.digits_data = digits.data # 1797 samples × 64 features (flattened 8×8 images)
 self.digits_target = digits.target # Digit labels (0-9)
 self.digits_images = digits.images # Original 8×8 image format for visualization
 
 logging.info(f"Digits dataset loaded: {self.digits_data.shape} features, {len(np.unique(self.digits_target))} classes")
 
 # CRITICAL: Standardize the data before applying dimensionality reduction
 # Why standardization is essential:
 # 1. Features have different scales (e.g., sepal length vs width)
 # 2. PCA is sensitive to feature scales - larger values dominate
 # 3. Distance-based methods (t-SNE, UMAP) need comparable scales
 # 4. Neural networks train better with normalized inputs
 
 self.scaler_iris = StandardScaler() # Create scaler for iris data
 # fit_transform: (1) calculates mean and std, (2) applies transformation
 self.iris_scaled = self.scaler_iris.fit_transform(self.iris_data)
 
 self.scaler_digits = StandardScaler() # Separate scaler for digits
 self.digits_scaled = self.scaler_digits.fit_transform(self.digits_data)
 
 logging.info("Data standardization completed")
 
 def apply_pca(self, data, dataset_name, n_components=2):
 """
 Apply Principal Component Analysis
 
 PCA finds linear combinations of original features that explain maximum variance
 
 Parameters:
 - data: Standardized input data
 - dataset_name: For organizing results
 - n_components: Number of dimensions to reduce to (2 for visualization)
 
 Design Choice: Using 2 components for easy visualization and comparison
 """
 logging.info(f"Applying PCA to {dataset_name} dataset")
 
 # Create PCA object with specified number of components
 pca = PCA(n_components=n_components)
 
 # fit_transform: (1) finds principal components, (2) projects data
 data_pca = pca.fit_transform(data)
 
 # Extract variance information - crucial for understanding quality
 explained_variance = pca.explained_variance_ratio_ # Proportion of variance per component
 cumulative_variance = np.cumsum(explained_variance) # Running total of explained variance
 
 logging.info(f"PCA completed for {dataset_name}")
 logging.info(f"Explained variance per component: {explained_variance}")
 logging.info(f"Cumulative explained variance: {cumulative_variance}")
 
 # Store comprehensive results for later analysis
 self.results[f'{dataset_name}_pca'] = {
 'transformed_data': data_pca, # Projected data points
 'explained_variance': explained_variance, # How much variance each PC explains
 'cumulative_variance': cumulative_variance, # Total variance captured
 'components': pca.components_ # The actual principal components (directions)
 }
 
 # Store trained model for potential reuse (e.g., transforming new data)
 self.models[f'{dataset_name}_pca'] = pca
 
 return data_pca, explained_variance
 
 def apply_tsne(self, data, dataset_name, n_components=2, perplexity=30):
 """
 Apply t-Distributed Stochastic Neighbor Embedding
 
 t-SNE preserves local neighborhood structure, excellent for visualization
 
 Key Parameters:
 - perplexity: Balance between local and global structure (typically 5-50)
 - n_components: Output dimensions (2 or 3 for visualization)
 
 Important: t-SNE is non-linear and non-deterministic
 """
 logging.info(f"Applying t-SNE to {dataset_name} dataset with perplexity={perplexity}")
 
 # Create t-SNE object with careful parameter selection
 # random_state=42: Ensures reproducible results
 # perplexity=30: Good default for most datasets (roughly sqrt(n_samples))
 tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=42)
 
 # fit_transform: t-SNE doesn't have separate fit/transform like PCA
 # It optimizes embedding directly from the data
 data_tsne = tsne.fit_transform(data)
 
 logging.info(f"t-SNE completed for {dataset_name}")
 # KL divergence: Lower values indicate better optimization
 logging.info(f"Final KL divergence: {tsne.kl_divergence_}")
 
 # Store results (note: no reusable model for t-SNE)
 self.results[f'{dataset_name}_tsne'] = {
 'transformed_data': data_tsne,
 'kl_divergence': tsne.kl_divergence_ # Quality metric
 }
 
 return data_tsne
 
 def apply_umap(self, data, dataset_name, n_components=2, n_neighbors=15):
 """
 Apply Uniform Manifold Approximation and Projection
 
 UMAP preserves both local and global structure better than t-SNE
 
 Key Parameters:
 - n_neighbors: Size of local neighborhood (typically 5-50)
 - n_components: Output dimensions
 
 Advantage: UMAP can transform new data (unlike t-SNE)
 """
 logging.info(f"Applying UMAP to {dataset_name} dataset with n_neighbors={n_neighbors}")
 
 # Create UMAP reducer with balanced parameters
 # n_neighbors=15: Good balance between local and global structure
 # random_state=42: Reproducible results
 umap_reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, random_state=42)
 
 # fit_transform: UMAP learns mapping and applies it
 data_umap = umap_reducer.fit_transform(data)
 
 logging.info(f"UMAP completed for {dataset_name}")
 
 # Store results and model (UMAP can transform new data)
 self.results[f'{dataset_name}_umap'] = {
 'transformed_data': data_umap
 }
 
 # Save model for potential reuse
 self.models[f'{dataset_name}_umap'] = umap_reducer
 
 return data_umap

class SimpleAutoencoder(nn.Module):
 """
 Neural network autoencoder for dimensionality reduction
 
 Architecture Design Rationale:
 - Encoder: Progressively reduces dimensions (input → 128 → 64 → encoding_dim)
 - Decoder: Mirrors encoder in reverse (encoding_dim → 64 → 128 → input)
 - ReLU activations: Introduce non-linearity while avoiding vanishing gradients
 - No activation on final layer: Allows reconstruction of any real values
 
 Design Choice: Simple but effective architecture
 - Avoids overly complex models that might not converge
 - Sufficient capacity for the datasets used
 - Easy to understand and modify
 """
 
 def __init__(self, input_dim, encoding_dim):
 """
 Initialize autoencoder layers
 
 Parameters:
 - input_dim: Original feature count (4 for iris, 64 for digits)
 - encoding_dim: Compressed representation size
 """
 super(SimpleAutoencoder, self).__init__()
 
 # Encoder: Compress input to lower dimensional representation
 self.encoder = nn.Sequential(
 nn.Linear(input_dim, 128), # First compression layer
 nn.ReLU(), # Non-linear activation
 nn.Linear(128, 64), # Second compression layer
 nn.ReLU(), # Non-linear activation
 nn.Linear(64, encoding_dim) # Final encoding layer (no activation)
 )
 
 # Decoder: Reconstruct original input from encoding
 self.decoder = nn.Sequential(
 nn.Linear(encoding_dim, 64), # Start expanding
 nn.ReLU(), # Non-linear activation
 nn.Linear(64, 128), # Continue expanding
 nn.ReLU(), # Non-linear activation
 nn.Linear(128, input_dim) # Final reconstruction (no activation)
 )
 
 def forward(self, x):
 """
 Forward pass through autoencoder
 
 Returns both decoded output and encoded representation
 This allows us to use the encoded representation for dimensionality reduction
 """
 encoded = self.encoder(x) # Compress input
 decoded = self.decoder(encoded) # Reconstruct from compression
 return decoded, encoded

def train_autoencoder(data, dataset_name, encoding_dim=10, epochs=100, lr=0.001):
 """
 Train autoencoder for dimensionality reduction
 
 Training Process:
 1. Convert data to PyTorch tensors
 2. Initialize model, loss function, and optimizer
 3. Training loop: forward pass → loss calculation → backpropagation
 4. Extract final encoded representations
 
 Hyperparameter Choices:
 - epochs=100: Sufficient for convergence on small datasets
 - lr=0.001: Conservative learning rate to avoid instability
 - Adam optimizer: Adaptive learning rate, good default choice
 - MSE loss: Appropriate for reconstruction tasks
 """
 logging.info(f"Training autoencoder for {dataset_name} dataset")
 logging.info(f"Input dimension: {data.shape[1]}, Encoding dimension: {encoding_dim}")
 
 # Convert numpy array to PyTorch tensor
 # FloatTensor: Standard data type for neural networks
 data_tensor = torch.FloatTensor(data)
 
 # Initialize model with appropriate dimensions
 model = SimpleAutoencoder(data.shape[1], encoding_dim)
 
 # Loss function: Mean Squared Error for reconstruction
 # Measures average squared difference between input and reconstruction
 criterion = nn.MSELoss()
 
 # Optimizer: Adam with learning rate
 # Adam adapts learning rate per parameter, generally robust
 optimizer = optim.Adam(model.parameters(), lr=lr)
 
 # Track training progress
 losses = []
 
 # Training loop
 for epoch in range(epochs):
 # Reset gradients (PyTorch accumulates gradients by default)
 optimizer.zero_grad()
 
 # Forward pass: get reconstruction and encoding
 reconstructed, encoded = model(data_tensor)
 
 # Calculate reconstruction loss
 # Goal: minimize difference between input and reconstruction
 loss = criterion(reconstructed, data_tensor)
 
 # Backward pass: calculate gradients
 loss.backward()
 
 # Update model parameters
 optimizer.step()
 
 # Store loss for monitoring
 losses.append(loss.item())
 
 # Periodic logging to monitor training progress
 if (epoch + 1) % 20 == 0:
 logging.info(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.6f}")
 
 # Extract final encoded representations for dimensionality reduction
 with torch.no_grad(): # Disable gradient computation for inference
 _, final_encoded = model(data_tensor)
 final_encoded = final_encoded.numpy() # Convert back to numpy
 
 logging.info(f"Autoencoder training completed for {dataset_name}")
 logging.info(f"Final reconstruction loss: {losses[-1]:.6f}")
 
 return final_encoded, model, losses

def evaluate_dimensionality_reduction(original_data, reduced_data, target, dataset_name, method_name):
 """
 Evaluate quality of dimensionality reduction using downstream classification
 
 Evaluation Strategy:
 1. Train classifier on original high-dimensional data
 2. Train classifier on reduced low-dimensional data
 3. Compare classification accuracies
 4. High accuracy retention indicates good dimensionality reduction
 
 Why This Evaluation Makes Sense:
 - Tests whether important information is preserved
 - Uses realistic downstream task (classification)
 - Provides interpretable metric (accuracy retention percentage)
 """
 logging.info(f"Evaluating {method_name} performance on {dataset_name} dataset")
 
 # Split data consistently for fair comparison
 # stratify=target: Ensures balanced class distribution in train/test sets
 # random_state=42: Reproducible splits
 X_train_orig, X_test_orig, y_train, y_test = train_test_split(
 original_data, target, test_size=0.3, random_state=42, stratify=target
 )
 
 # Split reduced data with identical split (same random_state)
 X_train_red, X_test_red, _, _ = train_test_split(
 reduced_data, target, test_size=0.3, random_state=42, stratify=target
 )
 
 # Train Random Forest classifiers
 # Random Forest Choice: Robust, handles different feature types well, good baseline
 rf_orig = RandomForestClassifier(random_state=42) # For original data
 rf_red = RandomForestClassifier(random_state=42) # For reduced data
 
 # Train both classifiers
 rf_orig.fit(X_train_orig, y_train)
 rf_red.fit(X_train_red, y_train)
 
 # Evaluate performance
 acc_orig = accuracy_score(y_test, rf_orig.predict(X_test_orig))
 acc_red = accuracy_score(y_test, rf_red.predict(X_test_red))
 
 # Log results with clear interpretation
 logging.info(f"Original data accuracy: {acc_orig:.4f}")
 logging.info(f"Reduced data accuracy: {acc_red:.4f}")
 logging.info(f"Accuracy retention: {(acc_red/acc_orig)*100:.2f}%")
 
 # Return structured results
 return {
 'original_accuracy': acc_orig,
 'reduced_accuracy': acc_red,
 'accuracy_retention': (acc_red/acc_orig)*100 # Key metric for comparison
 }

def create_visualizations(dr_suite):
 """
 Generate comprehensive visualizations comparing all methods
 
 Visualization Strategy:
 1. PCA explained variance plots - understand information retention
 2. Side-by-side method comparisons - visual quality assessment
 3. Dataset-specific plots - accommodate different characteristics
 
 Design Choices:
 - High DPI (300) for publication quality
 - Consistent color schemes for easy comparison
 - Clear titles and labels for interpretation
 """
 logging.info("Creating comprehensive visualizations")
 
 # 1. PCA Explained Variance Analysis
 # Shows how much information each principal component captures
 plt.figure(figsize=(12, 5))
 
 # Iris dataset explained variance
 plt.subplot(1, 2, 1)
 iris_pca_var = dr_suite.results['iris_pca']['explained_variance']
 plt.bar(range(1, len(iris_pca_var)+1), iris_pca_var)
 plt.title('Iris Dataset - PCA Explained Variance')
 plt.xlabel('Principal Component')
 plt.ylabel('Explained Variance Ratio')
 # Add percentage labels on bars for clarity
 for i, v in enumerate(iris_pca_var):
 plt.text(i+1, v + 0.01, f'{v:.1%}', ha='center')
 
 # Digits dataset explained variance
 plt.subplot(1, 2, 2)
 digits_pca_var = dr_suite.results['digits_pca']['explained_variance']
 plt.bar(range(1, len(digits_pca_var)+1), digits_pca_var)
 plt.title('Digits Dataset - PCA Explained Variance')
 plt.xlabel('Principal Component')
 plt.ylabel('Explained Variance Ratio')
 # Add percentage labels on bars
 for i, v in enumerate(digits_pca_var):
 plt.text(i+1, v + 0.002, f'{v:.1%}', ha='center')
 
 plt.tight_layout()
 plt.savefig('visualizations/pca_explained_variance.png', dpi=300, bbox_inches='tight')
 plt.close() # Close figure to free memory
 
 # 2. Iris Dataset Method Comparison
 # 2×2 grid showing different dimensionality reduction results
 fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
 # Original data visualization (using first 2 features)
 axes[0, 0].scatter(dr_suite.iris_data[:, 0], dr_suite.iris_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[0, 0].set_title('Original Data (First 2 Features)')
 axes[0, 0].set_xlabel('Sepal Length')
 axes[0, 0].set_ylabel('Sepal Width')
 # Add colorbar to show class mapping
 
 # PCA results
 pca_data = dr_suite.results['iris_pca']['transformed_data']
 scatter1 = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[0, 1].set_title('PCA Reduction')
 axes[0, 1].set_xlabel('PC1')
 axes[0, 1].set_ylabel('PC2')
 
 # t-SNE results
 tsne_data = dr_suite.results['iris_tsne']['transformed_data']
 axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[1, 0].set_title('t-SNE Reduction')
 axes[1, 0].set_xlabel('t-SNE 1')
 axes[1, 0].set_ylabel('t-SNE 2')
 
 # UMAP results
 umap_data = dr_suite.results['iris_umap']['transformed_data']
 axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], 
 c=dr_suite.iris_target, cmap='viridis', alpha=0.7)
 axes[1, 1].set_title('UMAP Reduction')
 axes[1, 1].set_xlabel('UMAP 1')
 axes[1, 1].set_ylabel('UMAP 2')
 
 plt.tight_layout()
 plt.savefig('visualizations/iris_comparison.png', dpi=300, bbox_inches='tight')
 plt.close()
 
 # 3. Digits Dataset Visualization
 # More challenging due to higher dimensionality and more classes
 fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
 # Show sample original digit
 axes[0, 0].imshow(dr_suite.digits_images[0], cmap='gray')
 axes[0, 0].set_title('Original Digit Images (8×8 pixels)')
 axes[0, 0].axis('off') # Remove axes for cleaner image display
 
 # PCA results for digits
 pca_data = dr_suite.results['digits_pca']['transformed_data']
 scatter2 = axes[0, 1].scatter(pca_data[:, 0], pca_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[0, 1].set_title('PCA - Digits Dataset')
 axes[0, 1].set_xlabel('PC1')
 axes[0, 1].set_ylabel('PC2')
 
 # t-SNE results for digits
 tsne_data = dr_suite.results['digits_tsne']['transformed_data']
 axes[1, 0].scatter(tsne_data[:, 0], tsne_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[1, 0].set_title('t-SNE - Digits Dataset')
 axes[1, 0].set_xlabel('t-SNE 1')
 axes[1, 0].set_ylabel('t-SNE 2')
 
 # UMAP results for digits
 umap_data = dr_suite.results['digits_umap']['transformed_data']
 axes[1, 1].scatter(umap_data[:, 0], umap_data[:, 1], 
 c=dr_suite.digits_target, cmap='tab10', alpha=0.7)
 axes[1, 1].set_title('UMAP - Digits Dataset')
 axes[1, 1].set_xlabel('UMAP 1')
 axes[1, 1].set_ylabel('UMAP 2')
 
 plt.tight_layout()
 plt.savefig('visualizations/digits_comparison.png', dpi=300, bbox_inches='tight')
 plt.close()
 
 logging.info("All visualizations saved to visualizations/ directory")

def main():
 """
 Main execution function that orchestrates the entire analysis
 
 Execution Flow:
 1. Initialize suite and load data
 2. Apply all dimensionality reduction methods
 3. Evaluate performance using classification
 4. Generate visualizations
 5. Save models and results
 6. Provide comprehensive summary
 
 Design Choice: Structured workflow ensures reproducibility and completeness
 """
 logging.info("Starting Dimensionality Reduction Suite")
 
 # Initialize the comprehensive suite
 dr_suite = DimensionalityReductionSuite()
 
 # Step 1: Data preparation
 dr_suite.load_and_prepare_data()
 
 # Step 2: Apply linear method (PCA)
 logging.info("=== APPLYING PCA ===")
 # Apply to both datasets with 2 components for comparison
 dr_suite.apply_pca(dr_suite.iris_scaled, 'iris', n_components=2)
 dr_suite.apply_pca(dr_suite.digits_scaled, 'digits', n_components=2)
 
 # Step 3: Apply non-linear manifold learning (t-SNE)
 logging.info("=== APPLYING t-SNE ===")
 # Use consistent parameters across datasets
 dr_suite.apply_tsne(dr_suite.iris_scaled, 'iris', perplexity=30)
 dr_suite.apply_tsne(dr_suite.digits_scaled, 'digits', perplexity=30)
 
 # Step 4: Apply modern manifold learning (UMAP)
 logging.info("=== APPLYING UMAP ===")
 # UMAP often provides good balance of local and global structure
 dr_suite.apply_umap(dr_suite.iris_scaled, 'iris', n_neighbors=15)
 dr_suite.apply_umap(dr_suite.digits_scaled, 'digits', n_neighbors=15)
 
 # Step 5: Apply neural network approach (Autoencoder)
 logging.info("=== APPLYING AUTOENCODER ===")
 # Different encoding dimensions based on dataset complexity
 iris_encoded, iris_autoencoder, iris_losses = train_autoencoder(
 dr_suite.iris_scaled, 'iris', encoding_dim=2, epochs=50, lr=0.001
 )
 
 digits_encoded, digits_autoencoder, digits_losses = train_autoencoder(
 dr_suite.digits_scaled, 'digits', encoding_dim=10, epochs=100, lr=0.001
 )
 
 # Store autoencoder results in consistent format
 dr_suite.results['iris_autoencoder'] = {
 'transformed_data': iris_encoded,
 'training_losses': iris_losses
 }
 
 dr_suite.results['digits_autoencoder'] = {
 'transformed_data': digits_encoded,
 'training_losses': digits_losses
 }
 
 # Step 6: Comprehensive evaluation
 logging.info("=== EVALUATING METHODS ===")
 evaluation_results = {}
 
 # Evaluate traditional methods on both datasets
 methods = ['pca', 'tsne', 'umap'] # Methods that work with 2D output
 
 # Iris dataset evaluation
 for method in methods:
 eval_result = evaluate_dimensionality_reduction(
 dr_suite.iris_scaled, # Original standardized data
 dr_suite.results[f'iris_{method}']['transformed_data'], # Reduced data
 dr_suite.iris_target, # Class labels for classification
 'iris', # Dataset name
 method.upper() # Method name for logging
 )
 evaluation_results[f'iris_{method}'] = eval_result
 
 # Digits dataset evaluation
 for method in methods:
 eval_result = evaluate_dimensionality_reduction(
 dr_suite.digits_scaled,
 dr_suite.results[f'digits_{method}']['transformed_data'],
 dr_suite.digits_target,
 'digits',
 method.upper()
 )
 evaluation_results[f'digits_{method}'] = eval_result
 
 # Step 7: Generate comprehensive visualizations
 create_visualizations(dr_suite)
 
 # Step 8: Save all trained models for future use
 logging.info("Saving trained models")
 
 # Save sklearn models using pickle (standard approach)
 with open('models/pca_iris.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['iris_pca'], f)
 
 with open('models/pca_digits.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['digits_pca'], f)
 
 with open('models/umap_iris.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['iris_umap'], f)
 
 with open('models/umap_digits.pkl', 'wb') as f:
 pickle.dump(dr_suite.models['digits_umap'], f)
 
 # Save PyTorch models using torch.save (state dictionaries)
 torch.save(iris_autoencoder.state_dict(), 'models/autoencoder_iris.pth')
 torch.save(digits_autoencoder.state_dict(), 'models/autoencoder_digits.pth')
 
 # Step 9: Create comprehensive results summary
 logging.info("Saving results summary")
 results_summary = {
 'timestamp': datetime.now().isoformat(), # When analysis was run
 'datasets': {
 'iris': {
 'original_features': dr_suite.iris_data.shape[1],
 'samples': dr_suite.iris_data.shape[0],
 'classes': len(np.unique(dr_suite.iris_target))
 },
 'digits': {
 'original_features': dr_suite.digits_data.shape[1],
 'samples': dr_suite.digits_data.shape[0],
 'classes': len(np.unique(dr_suite.digits_target))
 }
 },
 # PCA explained variance is crucial for understanding information retention
 'pca_explained_variance': {
 'iris': dr_suite.results['iris_pca']['explained_variance'].tolist(),
 'digits': dr_suite.results['digits_pca']['explained_variance'].tolist()
 },
 # Classification performance comparison across all methods
 'evaluation_results': evaluation_results,
 # Autoencoder training convergence metrics
 'autoencoder_final_losses': {
 'iris': iris_losses[-1], # Final reconstruction loss for iris
 'digits': digits_losses[-1] # Final reconstruction loss for digits
 }
 }
 
 # Save as JSON for easy reading and further analysis
 with open('results/dimensionality_reduction_summary.json', 'w') as f:
 json.dump(results_summary, f, indent=2) # indent=2 for readability
 
 # Step 10: Print comprehensive summary to console and log
 logging.info("=== FINAL SUMMARY ===")
 
 # PCA explained variance summary
 logging.info(f"Iris Dataset - PCA Explained Variance: {dr_suite.results['iris_pca']['explained_variance']}")
 logging.info(f"Digits Dataset - PCA Explained Variance: {dr_suite.results['digits_pca']['explained_variance']}")
 
 # Classification performance summary for easy comparison
 for dataset in ['iris', 'digits']:
 logging.info(f"\n{dataset.upper()} Dataset Classification Performance:")
 for method in ['pca', 'tsne', 'umap']:
 result = evaluation_results[f'{dataset}_{method}']
 logging.info(f" {method.upper()}: {result['accuracy_retention']:.2f}% accuracy retention")
 
 # Final status messages
 logging.info("\nAll models saved to models/ directory")
 logging.info("All results saved to results/ directory") 
 logging.info("All visualizations saved to visualizations/ directory")
 logging.info("Dimensionality Reduction Suite completed successfully!")

# Execute the main function when script is run directly
if __name__ == "__main__":
 main()