polymer-aging-ml / CODEBASE_INVENTORY.md
devjas1
(DOCS): Add comprehensive codebase inventory for ml-polymer-recycling project
8013c07

A newer version of the Streamlit SDK is available: 1.49.0

Upgrade

Codebase Inventory: ml-polymer-recycling

Overview

A comprehensive machine learning system for AI-driven polymer aging prediction and classification using spectral data analysis. The project implements multiple CNN architectures (Figure2CNN, ResNet1D, ResNet18Vision) to classify polymer degradation levels as a proxy for recyclability, built with Python, PyTorch, and featuring both CLI and Streamlit UI workflows.

Inventory by Category

1. Core Application Modules

  • Module Name: models/registry.py

    • Purpose: Central registry system for model architectures providing dynamic model selection and instantiation
    • Key Exports/Functions: choices(), build(name, input_length), _REGISTRY
    • Key Dependencies: models.figure2_cnn, models.resnet_cnn, models.resnet18_vision
    • External Dependencies: typing
  • Module Name: models/figure2_cnn.py

    • Purpose: CNN architecture implementation based on literature (Neo et al. 2023) for 1D Raman spectral classification
    • Key Exports/Functions: Figure2CNN class with conv blocks and classifier layers
    • Key Dependencies: None (self-contained)
    • External Dependencies: torch, torch.nn
  • Module Name: models/resnet_cnn.py

    • Purpose: ResNet1D implementation with residual blocks for deeper spectral feature learning
    • Key Exports/Functions: ResNet1D, ResidualBlock1D classes
    • Key Dependencies: None (self-contained)
    • External Dependencies: torch, torch.nn
  • Module Name: models/resnet18_vision.py

    • Purpose: ResNet18 architecture adapted for 1D spectral data processing
    • Key Exports/Functions: ResNet18Vision class
    • Key Dependencies: None (self-contained)
    • External Dependencies: torch, torch.nn
  • Module Name: utils/preprocessing.py

    • Purpose: Spectral data preprocessing utilities including resampling, baseline correction, smoothing, and normalization
    • Key Exports/Functions: preprocess_spectrum(), resample_spectrum(), remove_baseline(), normalize_spectrum(), smooth_spectrum()
    • Key Dependencies: None (self-contained)
    • External Dependencies: numpy, scipy.interpolate, scipy.signal, sklearn.preprocessing
  • Module Name: scripts/preprocess_dataset.py

    • Purpose: Comprehensive dataset preprocessing pipeline with CLI interface for Raman spectral data
    • Key Exports/Functions: preprocess_dataset(), resample_spectrum(), label_file(), preprocessing helper functions
    • Key Dependencies: scripts.discover_raman_files, scripts.plot_spectrum
    • External Dependencies: numpy, scipy, sklearn.preprocessing

2. Scripts & Automation

  • Script Name: validate_pipeline.sh

    • Trigger: Manual execution (./validate_pipeline.sh)
    • Apparent Function: Canonical smoke test validating the complete Raman pipeline from preprocessing through training to inference
    • Dependencies: conda, scripts/preprocess_dataset.py, scripts/train_model.py, scripts/run_inference.py, scripts/plot_spectrum.py
  • Script Name: scripts/train_model.py

    • Trigger: CLI execution (python scripts/train_model.py)
    • Apparent Function: 10-fold stratified cross-validation training with multiple model architectures and preprocessing options
    • Dependencies: scripts/preprocess_dataset, models/registry, reproducibility seeds, PyTorch training loop
  • Script Name: scripts/run_inference.py

    • Trigger: CLI execution (python scripts/run_inference.py)
    • Apparent Function: Single spectrum inference with model loading, preprocessing, and prediction output to JSON
    • Dependencies: models/registry, scripts/preprocess_dataset, trained model weights
  • Script Name: scripts/plot_spectrum.py

    • Trigger: CLI execution (python scripts/plot_spectrum.py)
    • Apparent Function: Visualization tool for Raman spectra with matplotlib plotting and file I/O
    • Dependencies: Spectrum loading utilities
  • Script Name: scripts/discover_raman_files.py

    • Trigger: Imported by other scripts
    • Apparent Function: File discovery and labeling utilities for Raman dataset management
    • Dependencies: File system operations, regex pattern matching
  • Script Name: scripts/list_spectra.py

    • Trigger: CLI or import
    • Apparent Function: Dataset inventory and spectrum listing utilities
    • Dependencies: File system scanning

3. Configuration & Data

  • File Name: deploy/hf-space/requirements.txt

    • Purpose: Python dependencies for Hugging Face Spaces deployment
    • Key Contents/Structure: streamlit, torch, torchvision, scikit-learn, scipy, numpy, pandas, matplotlib, fastapi, altair, huggingface-hub
  • File Name: deploy/hf-space/Dockerfile

    • Purpose: Container configuration for Hugging Face Spaces deployment
    • Key Contents/Structure: Python 3.13-slim base, build tools installation, Streamlit server configuration on port 8501
  • File Name: deploy/hf-space/sample_data/sta-1.txt

    • Purpose: Sample Raman spectrum for UI demonstration
    • Key Contents/Structure: Two-column wavenumber/intensity data format
  • File Name: deploy/hf-space/sample_data/sta-2.txt

    • Purpose: Additional sample Raman spectrum for UI testing
    • Key Contents/Structure: Two-column wavenumber/intensity data format
  • File Name: .gitignore

    • Purpose: Version control exclusions for datasets, build artifacts, and system files
    • Key Contents/Structure: datasets/, __pycache__/, model weights, logs, environment files, deprecated scripts
  • File Name: MANIFEST.git

    • Purpose: Git object manifest listing all tracked files with hashes
    • Key Contents/Structure: File paths, permissions, and SHA hashes for repository contents

4. Assets & Documentation

  • Asset Name: README.md

    • Purpose: Primary project documentation with objectives, architecture overview, and usage instructions
    • Key Contents/Structure: Project goals, model architectures table, structure diagram, installation guides, sample commands
  • Asset Name: GROUND_TRUTH_PIPELINE.md

    • Purpose: Comprehensive empirical baseline inventory documenting every aspect of the current system
    • Key Contents/Structure: 635-line detailed documentation of data handling, preprocessing, models, CLI workflow, UI workflow, and gap identification
  • Asset Name: docs/ENVIRONMENT_GUIDE.md

    • Purpose: Environment management guide for local and HPC deployment
    • Key Contents/Structure: Conda vs venv setup instructions, platform-specific configurations, dependency management
  • Asset Name: docs/PROJECT_TIMELINE.md

    • Purpose: Development milestone tracking and project progression documentation
    • Key Contents/Structure: Phase-based timeline from project kickoff through model expansion, tagged milestones
  • Asset Name: docs/sprint_log.md

    • Purpose: Sprint-based development log with specific technical changes and testing results
    • Key Contents/Structure: Chronological entries with goals, changes, tests, and notes for each development sprint
  • Asset Name: docs/REPRODUCIBILITY.md

    • Purpose: Scientific reproducibility guidelines and artifact control documentation
    • Key Contents/Structure: Validation procedures, artifact integrity, experimental controls
  • Asset Name: docs/HPC_REMOTE_SETUP.md

    • Purpose: High-performance computing environment setup for CWRU Pioneer cluster
    • Key Contents/Structure: HPC-specific configurations, remote access procedures, computational resource management
  • Asset Name: docs/BACKEND_MIGRATION_LOG.md

    • Purpose: Technical migration documentation for backend architecture changes
    • Key Contents/Structure: Migration procedures, compatibility notes, system architecture evolution

5. Deployment & UI Components

  • Module Name: deploy/hf-space/app.py
    • Purpose: Streamlit web application for polymer classification with file upload and model inference
    • Key Exports/Functions: Streamlit UI components, model loading, preprocessing pipeline, prediction display
    • Key Dependencies: models.figure2_cnn, models.resnet_cnn, utils.preprocessing (fallback), scripts.preprocess_dataset
    • External Dependencies: streamlit, torch, matplotlib, PIL, numpy

6. Model Artifacts & Outputs

  • File Name: outputs/resnet_model.pth
    • Purpose: Trained ResNet1D model weights for Raman spectrum classification
    • Key Contents/Structure: PyTorch state dictionary with model parameters

Workflows & Interactions

  • CLI Training Pipeline: The main training workflow starts with scripts/train_model.py which imports the model registry (models/registry.py) to dynamically select architectures (Figure2CNN, ResNet1D, or ResNet18Vision). It uses scripts/preprocess_dataset.py to load and preprocess Raman spectra from datasets/rdwp/, applying resampling, baseline correction, smoothing, and normalization. The script performs 10-fold stratified cross-validation and saves trained models to outputs/{model}_model.pth with diagnostics to outputs/logs/.

  • CLI Inference Pipeline: Running scripts/run_inference.py loads a trained model via the registry, processes a single Raman spectrum file through the same preprocessing pipeline, and outputs predictions in JSON format to outputs/inference/.

  • UI Workflow: The Streamlit application (deploy/hf-space/app.py) provides a web interface that loads trained models, accepts file uploads or sample data selection, but currently bypasses the full preprocessing pipeline (missing baseline correction, smoothing, and normalization steps) before running inference.

  • Validation Workflow: The validate_pipeline.sh script orchestrates a complete pipeline test by sequentially running preprocessing, training, inference, and plotting scripts to ensure reproducibility and catch regressions.

  • Model Registry System: All model architectures are centrally managed through models/registry.py, which provides dynamic model selection for both CLI training and inference scripts, ensuring consistent model instantiation across the codebase.

External Dependencies Summary

  • PyTorch Ecosystem: torch, torchvision for deep learning model implementation and training
  • Scientific Computing: numpy, scipy for numerical operations and signal processing
  • Machine Learning: scikit-learn for preprocessing, metrics, and cross-validation utilities
  • Data Handling: pandas for structured data manipulation
  • Visualization: matplotlib, seaborn for plotting and data visualization
  • Web Framework: streamlit for interactive web application deployment
  • Image Processing: PIL (Pillow) for image handling in the UI
  • Development Tools: argparse for CLI interfaces, json for data serialization
  • Deployment: fastapi, uvicorn for potential API deployment, huggingface-hub for model hosting

Key Findings & Assumptions

  • Critical Preprocessing Gap: The UI workflow in deploy/hf-space/app.py bypasses essential preprocessing steps (baseline correction, smoothing, normalization) that are standard in the CLI pipeline, potentially causing prediction inconsistencies.

  • Model Architecture Assumptions: Three CNN architectures are registered (figure2, resnet, resnet18vision) but the codebase suggests only two are currently trained and validated in the standard pipeline.

  • Dataset Structure: The system assumes Raman spectra are stored as two-column text files (wavenumber, intensity) in the datasets/rdwp/ directory, with filenames indicating weathering conditions for automated labeling.

  • Environment Fragmentation: The project uses different dependency management systems (Conda for local development, venv for HPC, pip requirements for deployment) which could lead to environment inconsistencies.

  • Reproducibility Controls: Strong emphasis on scientific reproducibility with fixed random seeds, deterministic algorithms, and comprehensive validation scripts, indicating this is research-oriented code requiring strict experimental controls.

  • Deployment Readiness: The Hugging Face Spaces deployment setup suggests the project is intended for public demonstration or research sharing, but the preprocessing gap needs resolution for production use.

  • Legacy Code Management: The .gitignore and documentation references suggest active management of deprecated FTIR-related components, indicating focused scope refinement to Raman-only analysis.