Spaces:
Running
A newer version of the Streamlit SDK is available:
1.49.0
Codebase Inventory: ml-polymer-recycling
Overview
A comprehensive machine learning system for AI-driven polymer aging prediction and classification using spectral data analysis. The project implements multiple CNN architectures (Figure2CNN, ResNet1D, ResNet18Vision) to classify polymer degradation levels as a proxy for recyclability, built with Python, PyTorch, and featuring both CLI and Streamlit UI workflows.
Inventory by Category
1. Core Application Modules
Module Name:
models/registry.py
- Purpose: Central registry system for model architectures providing dynamic model selection and instantiation
- Key Exports/Functions:
choices()
,build(name, input_length)
,_REGISTRY
- Key Dependencies:
models.figure2_cnn
,models.resnet_cnn
,models.resnet18_vision
- External Dependencies:
typing
Module Name:
models/figure2_cnn.py
- Purpose: CNN architecture implementation based on literature (Neo et al. 2023) for 1D Raman spectral classification
- Key Exports/Functions:
Figure2CNN
class with conv blocks and classifier layers - Key Dependencies: None (self-contained)
- External Dependencies:
torch
,torch.nn
Module Name:
models/resnet_cnn.py
- Purpose: ResNet1D implementation with residual blocks for deeper spectral feature learning
- Key Exports/Functions:
ResNet1D
,ResidualBlock1D
classes - Key Dependencies: None (self-contained)
- External Dependencies:
torch
,torch.nn
Module Name:
models/resnet18_vision.py
- Purpose: ResNet18 architecture adapted for 1D spectral data processing
- Key Exports/Functions:
ResNet18Vision
class - Key Dependencies: None (self-contained)
- External Dependencies:
torch
,torch.nn
Module Name:
utils/preprocessing.py
- Purpose: Spectral data preprocessing utilities including resampling, baseline correction, smoothing, and normalization
- Key Exports/Functions:
preprocess_spectrum()
,resample_spectrum()
,remove_baseline()
,normalize_spectrum()
,smooth_spectrum()
- Key Dependencies: None (self-contained)
- External Dependencies:
numpy
,scipy.interpolate
,scipy.signal
,sklearn.preprocessing
Module Name:
scripts/preprocess_dataset.py
- Purpose: Comprehensive dataset preprocessing pipeline with CLI interface for Raman spectral data
- Key Exports/Functions:
preprocess_dataset()
,resample_spectrum()
,label_file()
, preprocessing helper functions - Key Dependencies:
scripts.discover_raman_files
,scripts.plot_spectrum
- External Dependencies:
numpy
,scipy
,sklearn.preprocessing
2. Scripts & Automation
Script Name:
validate_pipeline.sh
- Trigger: Manual execution (
./validate_pipeline.sh
) - Apparent Function: Canonical smoke test validating the complete Raman pipeline from preprocessing through training to inference
- Dependencies:
conda
,scripts/preprocess_dataset.py
,scripts/train_model.py
,scripts/run_inference.py
,scripts/plot_spectrum.py
- Trigger: Manual execution (
Script Name:
scripts/train_model.py
- Trigger: CLI execution (
python scripts/train_model.py
) - Apparent Function: 10-fold stratified cross-validation training with multiple model architectures and preprocessing options
- Dependencies:
scripts/preprocess_dataset
,models/registry
, reproducibility seeds, PyTorch training loop
- Trigger: CLI execution (
Script Name:
scripts/run_inference.py
- Trigger: CLI execution (
python scripts/run_inference.py
) - Apparent Function: Single spectrum inference with model loading, preprocessing, and prediction output to JSON
- Dependencies:
models/registry
,scripts/preprocess_dataset
, trained model weights
- Trigger: CLI execution (
Script Name:
scripts/plot_spectrum.py
- Trigger: CLI execution (
python scripts/plot_spectrum.py
) - Apparent Function: Visualization tool for Raman spectra with matplotlib plotting and file I/O
- Dependencies: Spectrum loading utilities
- Trigger: CLI execution (
Script Name:
scripts/discover_raman_files.py
- Trigger: Imported by other scripts
- Apparent Function: File discovery and labeling utilities for Raman dataset management
- Dependencies: File system operations, regex pattern matching
Script Name:
scripts/list_spectra.py
- Trigger: CLI or import
- Apparent Function: Dataset inventory and spectrum listing utilities
- Dependencies: File system scanning
3. Configuration & Data
File Name:
deploy/hf-space/requirements.txt
- Purpose: Python dependencies for Hugging Face Spaces deployment
- Key Contents/Structure:
streamlit
,torch
,torchvision
,scikit-learn
,scipy
,numpy
,pandas
,matplotlib
,fastapi
,altair
,huggingface-hub
File Name:
deploy/hf-space/Dockerfile
- Purpose: Container configuration for Hugging Face Spaces deployment
- Key Contents/Structure: Python 3.13-slim base, build tools installation, Streamlit server configuration on port 8501
File Name:
deploy/hf-space/sample_data/sta-1.txt
- Purpose: Sample Raman spectrum for UI demonstration
- Key Contents/Structure: Two-column wavenumber/intensity data format
File Name:
deploy/hf-space/sample_data/sta-2.txt
- Purpose: Additional sample Raman spectrum for UI testing
- Key Contents/Structure: Two-column wavenumber/intensity data format
File Name:
.gitignore
- Purpose: Version control exclusions for datasets, build artifacts, and system files
- Key Contents/Structure:
datasets/
,__pycache__/
, model weights, logs, environment files, deprecated scripts
File Name:
MANIFEST.git
- Purpose: Git object manifest listing all tracked files with hashes
- Key Contents/Structure: File paths, permissions, and SHA hashes for repository contents
4. Assets & Documentation
Asset Name:
README.md
- Purpose: Primary project documentation with objectives, architecture overview, and usage instructions
- Key Contents/Structure: Project goals, model architectures table, structure diagram, installation guides, sample commands
Asset Name:
GROUND_TRUTH_PIPELINE.md
- Purpose: Comprehensive empirical baseline inventory documenting every aspect of the current system
- Key Contents/Structure: 635-line detailed documentation of data handling, preprocessing, models, CLI workflow, UI workflow, and gap identification
Asset Name:
docs/ENVIRONMENT_GUIDE.md
- Purpose: Environment management guide for local and HPC deployment
- Key Contents/Structure: Conda vs venv setup instructions, platform-specific configurations, dependency management
Asset Name:
docs/PROJECT_TIMELINE.md
- Purpose: Development milestone tracking and project progression documentation
- Key Contents/Structure: Phase-based timeline from project kickoff through model expansion, tagged milestones
Asset Name:
docs/sprint_log.md
- Purpose: Sprint-based development log with specific technical changes and testing results
- Key Contents/Structure: Chronological entries with goals, changes, tests, and notes for each development sprint
Asset Name:
docs/REPRODUCIBILITY.md
- Purpose: Scientific reproducibility guidelines and artifact control documentation
- Key Contents/Structure: Validation procedures, artifact integrity, experimental controls
Asset Name:
docs/HPC_REMOTE_SETUP.md
- Purpose: High-performance computing environment setup for CWRU Pioneer cluster
- Key Contents/Structure: HPC-specific configurations, remote access procedures, computational resource management
Asset Name:
docs/BACKEND_MIGRATION_LOG.md
- Purpose: Technical migration documentation for backend architecture changes
- Key Contents/Structure: Migration procedures, compatibility notes, system architecture evolution
5. Deployment & UI Components
- Module Name:
deploy/hf-space/app.py
- Purpose: Streamlit web application for polymer classification with file upload and model inference
- Key Exports/Functions: Streamlit UI components, model loading, preprocessing pipeline, prediction display
- Key Dependencies:
models.figure2_cnn
,models.resnet_cnn
,utils.preprocessing
(fallback),scripts.preprocess_dataset
- External Dependencies:
streamlit
,torch
,matplotlib
,PIL
,numpy
6. Model Artifacts & Outputs
- File Name:
outputs/resnet_model.pth
- Purpose: Trained ResNet1D model weights for Raman spectrum classification
- Key Contents/Structure: PyTorch state dictionary with model parameters
Workflows & Interactions
CLI Training Pipeline: The main training workflow starts with
scripts/train_model.py
which imports the model registry (models/registry.py
) to dynamically select architectures (Figure2CNN, ResNet1D, or ResNet18Vision). It usesscripts/preprocess_dataset.py
to load and preprocess Raman spectra fromdatasets/rdwp/
, applying resampling, baseline correction, smoothing, and normalization. The script performs 10-fold stratified cross-validation and saves trained models tooutputs/{model}_model.pth
with diagnostics tooutputs/logs/
.CLI Inference Pipeline: Running
scripts/run_inference.py
loads a trained model via the registry, processes a single Raman spectrum file through the same preprocessing pipeline, and outputs predictions in JSON format tooutputs/inference/
.UI Workflow: The Streamlit application (
deploy/hf-space/app.py
) provides a web interface that loads trained models, accepts file uploads or sample data selection, but currently bypasses the full preprocessing pipeline (missing baseline correction, smoothing, and normalization steps) before running inference.Validation Workflow: The
validate_pipeline.sh
script orchestrates a complete pipeline test by sequentially running preprocessing, training, inference, and plotting scripts to ensure reproducibility and catch regressions.Model Registry System: All model architectures are centrally managed through
models/registry.py
, which provides dynamic model selection for both CLI training and inference scripts, ensuring consistent model instantiation across the codebase.
External Dependencies Summary
- PyTorch Ecosystem:
torch
,torchvision
for deep learning model implementation and training - Scientific Computing:
numpy
,scipy
for numerical operations and signal processing - Machine Learning:
scikit-learn
for preprocessing, metrics, and cross-validation utilities - Data Handling:
pandas
for structured data manipulation - Visualization:
matplotlib
,seaborn
for plotting and data visualization - Web Framework:
streamlit
for interactive web application deployment - Image Processing:
PIL
(Pillow) for image handling in the UI - Development Tools:
argparse
for CLI interfaces,json
for data serialization - Deployment:
fastapi
,uvicorn
for potential API deployment,huggingface-hub
for model hosting
Key Findings & Assumptions
Critical Preprocessing Gap: The UI workflow in
deploy/hf-space/app.py
bypasses essential preprocessing steps (baseline correction, smoothing, normalization) that are standard in the CLI pipeline, potentially causing prediction inconsistencies.Model Architecture Assumptions: Three CNN architectures are registered (
figure2
,resnet
,resnet18vision
) but the codebase suggests only two are currently trained and validated in the standard pipeline.Dataset Structure: The system assumes Raman spectra are stored as two-column text files (wavenumber, intensity) in the
datasets/rdwp/
directory, with filenames indicating weathering conditions for automated labeling.Environment Fragmentation: The project uses different dependency management systems (Conda for local development, venv for HPC, pip requirements for deployment) which could lead to environment inconsistencies.
Reproducibility Controls: Strong emphasis on scientific reproducibility with fixed random seeds, deterministic algorithms, and comprehensive validation scripts, indicating this is research-oriented code requiring strict experimental controls.
Deployment Readiness: The Hugging Face Spaces deployment setup suggests the project is intended for public demonstration or research sharing, but the preprocessing gap needs resolution for production use.
Legacy Code Management: The
.gitignore
and documentation references suggest active management of deprecated FTIR-related components, indicating focused scope refinement to Raman-only analysis.