File size: 12,813 Bytes
8013c07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Codebase Inventory: ml-polymer-recycling

## Overview

A comprehensive machine learning system for AI-driven polymer aging prediction and classification using spectral data analysis. The project implements multiple CNN architectures (Figure2CNN, ResNet1D, ResNet18Vision) to classify polymer degradation levels as a proxy for recyclability, built with Python, PyTorch, and featuring both CLI and Streamlit UI workflows.

## Inventory by Category

### 1. Core Application Modules

- **Module Name**: `models/registry.py`
  - **Purpose**: Central registry system for model architectures providing dynamic model selection and instantiation
  - **Key Exports/Functions**: `choices()`, `build(name, input_length)`, `_REGISTRY`
  - **Key Dependencies**: `models.figure2_cnn`, `models.resnet_cnn`, `models.resnet18_vision`
  - **External Dependencies**: `typing`

- **Module Name**: `models/figure2_cnn.py`
  - **Purpose**: CNN architecture implementation based on literature (Neo et al. 2023) for 1D Raman spectral classification
  - **Key Exports/Functions**: `Figure2CNN` class with conv blocks and classifier layers
  - **Key Dependencies**: None (self-contained)
  - **External Dependencies**: `torch`, `torch.nn`

- **Module Name**: `models/resnet_cnn.py`
  - **Purpose**: ResNet1D implementation with residual blocks for deeper spectral feature learning
  - **Key Exports/Functions**: `ResNet1D`, `ResidualBlock1D` classes
  - **Key Dependencies**: None (self-contained)
  - **External Dependencies**: `torch`, `torch.nn`

- **Module Name**: `models/resnet18_vision.py`
  - **Purpose**: ResNet18 architecture adapted for 1D spectral data processing
  - **Key Exports/Functions**: `ResNet18Vision` class
  - **Key Dependencies**: None (self-contained)
  - **External Dependencies**: `torch`, `torch.nn`

- **Module Name**: `utils/preprocessing.py`
  - **Purpose**: Spectral data preprocessing utilities including resampling, baseline correction, smoothing, and normalization
  - **Key Exports/Functions**: `preprocess_spectrum()`, `resample_spectrum()`, `remove_baseline()`, `normalize_spectrum()`, `smooth_spectrum()`
  - **Key Dependencies**: None (self-contained)
  - **External Dependencies**: `numpy`, `scipy.interpolate`, `scipy.signal`, `sklearn.preprocessing`

- **Module Name**: `scripts/preprocess_dataset.py`
  - **Purpose**: Comprehensive dataset preprocessing pipeline with CLI interface for Raman spectral data
  - **Key Exports/Functions**: `preprocess_dataset()`, `resample_spectrum()`, `label_file()`, preprocessing helper functions
  - **Key Dependencies**: `scripts.discover_raman_files`, `scripts.plot_spectrum`
  - **External Dependencies**: `numpy`, `scipy`, `sklearn.preprocessing`

### 2. Scripts & Automation

- **Script Name**: `validate_pipeline.sh`
  - **Trigger**: Manual execution (`./validate_pipeline.sh`)
  - **Apparent Function**: Canonical smoke test validating the complete Raman pipeline from preprocessing through training to inference
  - **Dependencies**: `conda`, `scripts/preprocess_dataset.py`, `scripts/train_model.py`, `scripts/run_inference.py`, `scripts/plot_spectrum.py`

- **Script Name**: `scripts/train_model.py`
  - **Trigger**: CLI execution (`python scripts/train_model.py`)
  - **Apparent Function**: 10-fold stratified cross-validation training with multiple model architectures and preprocessing options
  - **Dependencies**: `scripts/preprocess_dataset`, `models/registry`, reproducibility seeds, PyTorch training loop

- **Script Name**: `scripts/run_inference.py`
  - **Trigger**: CLI execution (`python scripts/run_inference.py`)
  - **Apparent Function**: Single spectrum inference with model loading, preprocessing, and prediction output to JSON
  - **Dependencies**: `models/registry`, `scripts/preprocess_dataset`, trained model weights

- **Script Name**: `scripts/plot_spectrum.py`
  - **Trigger**: CLI execution (`python scripts/plot_spectrum.py`)
  - **Apparent Function**: Visualization tool for Raman spectra with matplotlib plotting and file I/O
  - **Dependencies**: Spectrum loading utilities

- **Script Name**: `scripts/discover_raman_files.py`
  - **Trigger**: Imported by other scripts
  - **Apparent Function**: File discovery and labeling utilities for Raman dataset management
  - **Dependencies**: File system operations, regex pattern matching

- **Script Name**: `scripts/list_spectra.py`
  - **Trigger**: CLI or import
  - **Apparent Function**: Dataset inventory and spectrum listing utilities
  - **Dependencies**: File system scanning

### 3. Configuration & Data

- **File Name**: `deploy/hf-space/requirements.txt`
  - **Purpose**: Python dependencies for Hugging Face Spaces deployment
  - **Key Contents/Structure**: `streamlit`, `torch`, `torchvision`, `scikit-learn`, `scipy`, `numpy`, `pandas`, `matplotlib`, `fastapi`, `altair`, `huggingface-hub`

- **File Name**: `deploy/hf-space/Dockerfile`
  - **Purpose**: Container configuration for Hugging Face Spaces deployment
  - **Key Contents/Structure**: Python 3.13-slim base, build tools installation, Streamlit server configuration on port 8501

- **File Name**: `deploy/hf-space/sample_data/sta-1.txt`
  - **Purpose**: Sample Raman spectrum for UI demonstration
  - **Key Contents/Structure**: Two-column wavenumber/intensity data format

- **File Name**: `deploy/hf-space/sample_data/sta-2.txt`
  - **Purpose**: Additional sample Raman spectrum for UI testing
  - **Key Contents/Structure**: Two-column wavenumber/intensity data format

- **File Name**: `.gitignore`
  - **Purpose**: Version control exclusions for datasets, build artifacts, and system files
  - **Key Contents/Structure**: `datasets/`, `__pycache__/`, model weights, logs, environment files, deprecated scripts

- **File Name**: `MANIFEST.git`
  - **Purpose**: Git object manifest listing all tracked files with hashes
  - **Key Contents/Structure**: File paths, permissions, and SHA hashes for repository contents

### 4. Assets & Documentation

- **Asset Name**: `README.md`
  - **Purpose**: Primary project documentation with objectives, architecture overview, and usage instructions
  - **Key Contents/Structure**: Project goals, model architectures table, structure diagram, installation guides, sample commands

- **Asset Name**: `GROUND_TRUTH_PIPELINE.md`
  - **Purpose**: Comprehensive empirical baseline inventory documenting every aspect of the current system
  - **Key Contents/Structure**: 635-line detailed documentation of data handling, preprocessing, models, CLI workflow, UI workflow, and gap identification

- **Asset Name**: `docs/ENVIRONMENT_GUIDE.md`
  - **Purpose**: Environment management guide for local and HPC deployment
  - **Key Contents/Structure**: Conda vs venv setup instructions, platform-specific configurations, dependency management

- **Asset Name**: `docs/PROJECT_TIMELINE.md`
  - **Purpose**: Development milestone tracking and project progression documentation
  - **Key Contents/Structure**: Phase-based timeline from project kickoff through model expansion, tagged milestones

- **Asset Name**: `docs/sprint_log.md`
  - **Purpose**: Sprint-based development log with specific technical changes and testing results
  - **Key Contents/Structure**: Chronological entries with goals, changes, tests, and notes for each development sprint

- **Asset Name**: `docs/REPRODUCIBILITY.md`
  - **Purpose**: Scientific reproducibility guidelines and artifact control documentation
  - **Key Contents/Structure**: Validation procedures, artifact integrity, experimental controls

- **Asset Name**: `docs/HPC_REMOTE_SETUP.md`
  - **Purpose**: High-performance computing environment setup for CWRU Pioneer cluster
  - **Key Contents/Structure**: HPC-specific configurations, remote access procedures, computational resource management

- **Asset Name**: `docs/BACKEND_MIGRATION_LOG.md`
  - **Purpose**: Technical migration documentation for backend architecture changes
  - **Key Contents/Structure**: Migration procedures, compatibility notes, system architecture evolution

### 5. Deployment & UI Components

- **Module Name**: `deploy/hf-space/app.py`
  - **Purpose**: Streamlit web application for polymer classification with file upload and model inference
  - **Key Exports/Functions**: Streamlit UI components, model loading, preprocessing pipeline, prediction display
  - **Key Dependencies**: `models.figure2_cnn`, `models.resnet_cnn`, `utils.preprocessing` (fallback), `scripts.preprocess_dataset`
  - **External Dependencies**: `streamlit`, `torch`, `matplotlib`, `PIL`, `numpy`

### 6. Model Artifacts & Outputs

- **File Name**: `outputs/resnet_model.pth`
  - **Purpose**: Trained ResNet1D model weights for Raman spectrum classification
  - **Key Contents/Structure**: PyTorch state dictionary with model parameters

## Workflows & Interactions

- **CLI Training Pipeline**: The main training workflow starts with `scripts/train_model.py` which imports the model registry (`models/registry.py`) to dynamically select architectures (Figure2CNN, ResNet1D, or ResNet18Vision). It uses `scripts/preprocess_dataset.py` to load and preprocess Raman spectra from `datasets/rdwp/`, applying resampling, baseline correction, smoothing, and normalization. The script performs 10-fold stratified cross-validation and saves trained models to `outputs/{model}_model.pth` with diagnostics to `outputs/logs/`.

- **CLI Inference Pipeline**: Running `scripts/run_inference.py` loads a trained model via the registry, processes a single Raman spectrum file through the same preprocessing pipeline, and outputs predictions in JSON format to `outputs/inference/`.

- **UI Workflow**: The Streamlit application (`deploy/hf-space/app.py`) provides a web interface that loads trained models, accepts file uploads or sample data selection, but currently bypasses the full preprocessing pipeline (missing baseline correction, smoothing, and normalization steps) before running inference.

- **Validation Workflow**: The `validate_pipeline.sh` script orchestrates a complete pipeline test by sequentially running preprocessing, training, inference, and plotting scripts to ensure reproducibility and catch regressions.

- **Model Registry System**: All model architectures are centrally managed through `models/registry.py`, which provides dynamic model selection for both CLI training and inference scripts, ensuring consistent model instantiation across the codebase.

## External Dependencies Summary

- **PyTorch Ecosystem**: `torch`, `torchvision` for deep learning model implementation and training
- **Scientific Computing**: `numpy`, `scipy` for numerical operations and signal processing
- **Machine Learning**: `scikit-learn` for preprocessing, metrics, and cross-validation utilities
- **Data Handling**: `pandas` for structured data manipulation
- **Visualization**: `matplotlib`, `seaborn` for plotting and data visualization
- **Web Framework**: `streamlit` for interactive web application deployment
- **Image Processing**: `PIL` (Pillow) for image handling in the UI
- **Development Tools**: `argparse` for CLI interfaces, `json` for data serialization
- **Deployment**: `fastapi`, `uvicorn` for potential API deployment, `huggingface-hub` for model hosting

## Key Findings & Assumptions

- **Critical Preprocessing Gap**: The UI workflow in `deploy/hf-space/app.py` bypasses essential preprocessing steps (baseline correction, smoothing, normalization) that are standard in the CLI pipeline, potentially causing prediction inconsistencies.

- **Model Architecture Assumptions**: Three CNN architectures are registered (`figure2`, `resnet`, `resnet18vision`) but the codebase suggests only two are currently trained and validated in the standard pipeline.

- **Dataset Structure**: The system assumes Raman spectra are stored as two-column text files (wavenumber, intensity) in the `datasets/rdwp/` directory, with filenames indicating weathering conditions for automated labeling.

- **Environment Fragmentation**: The project uses different dependency management systems (Conda for local development, venv for HPC, pip requirements for deployment) which could lead to environment inconsistencies.

- **Reproducibility Controls**: Strong emphasis on scientific reproducibility with fixed random seeds, deterministic algorithms, and comprehensive validation scripts, indicating this is research-oriented code requiring strict experimental controls.

- **Deployment Readiness**: The Hugging Face Spaces deployment setup suggests the project is intended for public demonstration or research sharing, but the preprocessing gap needs resolution for production use.

- **Legacy Code Management**: The `.gitignore` and documentation references suggest active management of deprecated FTIR-related components, indicating focused scope refinement to Raman-only analysis.