Spaces:
Configuration error
Configuration error
Upload 14 files
Browse files- .gitattributes +3 -0
- README.md +119 -13
- US_engineered_features.csv +0 -0
- covid_country_ts.csv +3 -0
- covid_full_dataset.csv +3 -0
- gradio_app.py +363 -0
- hf_space.yml +3 -0
- preprocess_data.py +392 -0
- raw_confirmed.csv +0 -0
- raw_deaths.csv +0 -0
- raw_owid.csv +3 -0
- raw_recovered.csv +0 -0
- requirements.txt +10 -0
- run_pipeline.py +93 -0
- train_models.py +165 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
covid_country_ts.csv filter=lfs diff=lfs merge=lfs -text
|
37 |
+
covid_full_dataset.csv filter=lfs diff=lfs merge=lfs -text
|
38 |
+
raw_owid.csv filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,13 +1,119 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# COVID-19 Prediction Model
|
2 |
+
|
3 |
+
This project implements a COVID-19 prediction system using regression models with a focus on Random Forest and three other regression models. The system includes a Gradio user interface for Hugging Face deployment.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Memory-optimized** data processing that can handle multiple datasets of different types and object types
|
8 |
+
- **Multiple regression models** for comparison:
|
9 |
+
- Random Forest Regression
|
10 |
+
- Linear Regression
|
11 |
+
- Support Vector Regression (SVR)
|
12 |
+
- Gradient Boosting Regression
|
13 |
+
- **Gradio UI** for easy model selection, visualization, and deployment to Hugging Face Spaces
|
14 |
+
- Complete **data preprocessing pipeline** with feature engineering
|
15 |
+
- **Performance evaluation** metrics and visualization
|
16 |
+
|
17 |
+
## Project Structure
|
18 |
+
|
19 |
+
```
|
20 |
+
COVID-19-Prediction/
|
21 |
+
├── covid_full_dataset.csv # Complete COVID-19 dataset
|
22 |
+
├── US_engineered_features.csv # Engineered features for US data
|
23 |
+
├── raw_confirmed.csv # Raw confirmed cases data
|
24 |
+
├── raw_deaths.csv # Raw deaths data
|
25 |
+
├── raw_recovered.csv # Raw recovered cases data
|
26 |
+
├── raw_owid.csv # Additional data from Our World in Data
|
27 |
+
├── covid_country_ts.csv # Country-level time series data
|
28 |
+
├── preprocess_data.py # Data preprocessing script
|
29 |
+
├── train_models.py # Model training script
|
30 |
+
├── gradio_app.py # Gradio UI for predictions
|
31 |
+
├── run_pipeline.py # Complete pipeline runner
|
32 |
+
└── requirements.txt # Project dependencies
|
33 |
+
```
|
34 |
+
|
35 |
+
## Installation
|
36 |
+
|
37 |
+
1. Clone this repository:
|
38 |
+
```
|
39 |
+
git clone https://github.com/yourusername/covid19-prediction.git
|
40 |
+
cd covid19-prediction
|
41 |
+
```
|
42 |
+
|
43 |
+
2. Install the required packages:
|
44 |
+
```
|
45 |
+
pip install -r requirements.txt
|
46 |
+
```
|
47 |
+
|
48 |
+
## Usage
|
49 |
+
|
50 |
+
### Run the Complete Pipeline
|
51 |
+
|
52 |
+
To run the complete pipeline (preprocessing, training, and UI):
|
53 |
+
|
54 |
+
```
|
55 |
+
python run_pipeline.py
|
56 |
+
```
|
57 |
+
|
58 |
+
### Pipeline Options
|
59 |
+
|
60 |
+
- Skip preprocessing: `python run_pipeline.py --skip-preprocessing`
|
61 |
+
- Skip training: `python run_pipeline.py --skip-training`
|
62 |
+
- Only launch UI: `python run_pipeline.py --only-ui`
|
63 |
+
|
64 |
+
### Run Individual Steps
|
65 |
+
|
66 |
+
1. **Data Preprocessing**:
|
67 |
+
```
|
68 |
+
python preprocess_data.py
|
69 |
+
```
|
70 |
+
|
71 |
+
2. **Model Training**:
|
72 |
+
```
|
73 |
+
python train_models.py
|
74 |
+
```
|
75 |
+
|
76 |
+
3. **Launch Gradio UI**:
|
77 |
+
```
|
78 |
+
python gradio_app.py
|
79 |
+
```
|
80 |
+
|
81 |
+
## Memory Optimization
|
82 |
+
|
83 |
+
This project is optimized to handle large datasets efficiently:
|
84 |
+
|
85 |
+
- Uses appropriate data types to minimize memory footprint
|
86 |
+
- Processes data in chunks for large files
|
87 |
+
- Employs garbage collection to free memory
|
88 |
+
- Uses compressed NumPy formats for storing processed data
|
89 |
+
- Optimizes model parameters for memory efficiency
|
90 |
+
|
91 |
+
## Models
|
92 |
+
|
93 |
+
The project implements and compares four regression models:
|
94 |
+
|
95 |
+
1. **Random Forest Regressor**: An ensemble learning method that builds multiple decision trees and merges their predictions.
|
96 |
+
2. **Linear Regression**: A simple baseline model that assumes a linear relationship between features and target.
|
97 |
+
3. **Support Vector Regression (SVR)**: Uses support vectors to create a regression model that can capture non-linear relationships.
|
98 |
+
4. **Gradient Boosting Regressor**: An ensemble technique that builds trees sequentially, with each tree correcting errors made by previous ones.
|
99 |
+
|
100 |
+
## Hugging Face Deployment
|
101 |
+
|
102 |
+
The Gradio interface is configured for easy deployment to Hugging Face Spaces:
|
103 |
+
|
104 |
+
1. Create a new Space on Hugging Face
|
105 |
+
2. Upload all files to the Space
|
106 |
+
3. The app will automatically configure for the Hugging Face environment
|
107 |
+
|
108 |
+
## Contributing
|
109 |
+
|
110 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
111 |
+
|
112 |
+
## License
|
113 |
+
|
114 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
115 |
+
|
116 |
+
## Acknowledgments
|
117 |
+
|
118 |
+
- Data sources: Johns Hopkins CSSE, Our World in Data
|
119 |
+
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Gradio
|
US_engineered_features.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
covid_country_ts.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1af870f40973ef368b35e81c247ffde415e73db91bf80764412a17d86847474d
|
3 |
+
size 19035987
|
covid_full_dataset.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:382205dd481f9868e30e5d3574714061e9e9c492015061934d5bc92176cbed04
|
3 |
+
size 44431554
|
gradio_app.py
ADDED
@@ -0,0 +1,363 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import joblib
|
5 |
+
from datetime import datetime, timedelta
|
6 |
+
import os
|
7 |
+
import matplotlib.pyplot as plt
|
8 |
+
import gc
|
9 |
+
from typing import Dict, List, Tuple, Union, Any
|
10 |
+
|
11 |
+
# Load the models and scaler
|
12 |
+
def load_models():
|
13 |
+
models = {}
|
14 |
+
model_files = [f for f in os.listdir() if f.endswith('_model.pkl')]
|
15 |
+
|
16 |
+
if not model_files:
|
17 |
+
print("No trained models found. Please run train_models.py first.")
|
18 |
+
return None
|
19 |
+
|
20 |
+
for model_file in model_files:
|
21 |
+
model_name = ' '.join([word.capitalize() for word in model_file.replace('_model.pkl', '').split('_')])
|
22 |
+
models[model_name] = joblib.load(model_file)
|
23 |
+
|
24 |
+
print(f"Loaded {len(models)} models: {', '.join(models.keys())}")
|
25 |
+
return models
|
26 |
+
|
27 |
+
# Create a function to get most recent data for prediction
|
28 |
+
def get_recent_data(data_file='US_engineered_features.csv', rows=30, optimize_memory=True):
|
29 |
+
"""
|
30 |
+
Load and process recent data for prediction
|
31 |
+
|
32 |
+
Parameters:
|
33 |
+
-----------
|
34 |
+
data_file : str
|
35 |
+
Path to the data file
|
36 |
+
rows : int
|
37 |
+
Number of rows to retrieve from the end of the dataset
|
38 |
+
optimize_memory : bool
|
39 |
+
Whether to optimize memory usage
|
40 |
+
|
41 |
+
Returns:
|
42 |
+
--------
|
43 |
+
pd.DataFrame
|
44 |
+
Recent data sorted by date
|
45 |
+
"""
|
46 |
+
# For memory optimization, define dtypes for critical columns
|
47 |
+
if optimize_memory:
|
48 |
+
dtype_dict = {
|
49 |
+
'New_Confirmed': 'int32',
|
50 |
+
'Deaths': 'int32',
|
51 |
+
'Confirmed': 'int32',
|
52 |
+
'Country/Region': 'category'
|
53 |
+
}
|
54 |
+
|
55 |
+
# Read only necessary columns if file is large
|
56 |
+
try:
|
57 |
+
# First check the file size
|
58 |
+
file_size = os.path.getsize(data_file) / (1024 * 1024) # Size in MB
|
59 |
+
|
60 |
+
if file_size > 100: # If file is larger than 100MB
|
61 |
+
# Get column list first
|
62 |
+
df_cols = pd.read_csv(data_file, nrows=1).columns.tolist()
|
63 |
+
|
64 |
+
# Define essential columns for prediction
|
65 |
+
essential_cols = ['Date', 'Country/Region', 'New_Confirmed', 'Deaths', 'Confirmed',
|
66 |
+
'Recovered', 'New_Deaths', 'New_Recovered',
|
67 |
+
'population', 'population_density', 'median_age']
|
68 |
+
|
69 |
+
# Filter to columns that exist in the dataset
|
70 |
+
cols_to_use = [col for col in essential_cols if col in df_cols]
|
71 |
+
|
72 |
+
# Read only the tail portion of the file for efficiency
|
73 |
+
df = pd.read_csv(data_file,
|
74 |
+
usecols=cols_to_use,
|
75 |
+
dtype={col: dtype_dict.get(col, None) for col in cols_to_use if col in dtype_dict})
|
76 |
+
else:
|
77 |
+
df = pd.read_csv(data_file, dtype=dtype_dict)
|
78 |
+
except Exception as e:
|
79 |
+
print(f"Error optimizing data load: {e}")
|
80 |
+
# Fall back to standard loading
|
81 |
+
df = pd.read_csv(data_file)
|
82 |
+
else:
|
83 |
+
df = pd.read_csv(data_file)
|
84 |
+
|
85 |
+
# Convert Date to datetime
|
86 |
+
df['Date'] = pd.to_datetime(df['Date'])
|
87 |
+
|
88 |
+
# Sort and get recent data
|
89 |
+
df = df.sort_values('Date', ascending=False).head(rows)
|
90 |
+
|
91 |
+
# Create a plot of recent confirmed cases
|
92 |
+
plt.figure(figsize=(10, 6))
|
93 |
+
plt.plot(df['Date'], df['New_Confirmed'], marker='o')
|
94 |
+
plt.title('Recent New Confirmed COVID-19 Cases')
|
95 |
+
plt.xlabel('Date')
|
96 |
+
plt.ylabel('New Confirmed Cases')
|
97 |
+
plt.xticks(rotation=45)
|
98 |
+
plt.tight_layout()
|
99 |
+
plt.savefig('recent_cases.png')
|
100 |
+
plt.close() # Close to free memory
|
101 |
+
|
102 |
+
return df
|
103 |
+
|
104 |
+
# Function to create predictions
|
105 |
+
def make_prediction(model, data, feature_names, days_to_predict=7):
|
106 |
+
"""
|
107 |
+
Make prediction using the selected model and data
|
108 |
+
|
109 |
+
Parameters:
|
110 |
+
-----------
|
111 |
+
model : object
|
112 |
+
Trained model with predict method
|
113 |
+
data : pd.DataFrame
|
114 |
+
Data to make prediction on
|
115 |
+
feature_names : List[str]
|
116 |
+
Names of the features used for prediction
|
117 |
+
days_to_predict : int
|
118 |
+
Number of days ahead to predict
|
119 |
+
|
120 |
+
Returns:
|
121 |
+
--------
|
122 |
+
Tuple[float, str]
|
123 |
+
Prediction value and prediction date
|
124 |
+
"""
|
125 |
+
# Get the most recent row
|
126 |
+
recent_data = data.iloc[0:1]
|
127 |
+
|
128 |
+
# Handle missing features - fill with median/mode values if needed
|
129 |
+
missing_features = [f for f in feature_names if f not in recent_data.columns]
|
130 |
+
if missing_features:
|
131 |
+
print(f"Warning: {len(missing_features)} features are missing from the dataset and will be filled with defaults")
|
132 |
+
for feat in missing_features:
|
133 |
+
# Use a default value of 0 for missing features
|
134 |
+
recent_data[feat] = 0
|
135 |
+
|
136 |
+
# Handle NaN values
|
137 |
+
for feat in feature_names:
|
138 |
+
if feat in recent_data.columns and recent_data[feat].isna().any():
|
139 |
+
recent_data[feat] = recent_data[feat].fillna(0)
|
140 |
+
|
141 |
+
# Extract features - make sure to keep only the features the model was trained on
|
142 |
+
try:
|
143 |
+
features = recent_data[feature_names].values
|
144 |
+
|
145 |
+
# Convert to float32 for memory efficiency and compatibility
|
146 |
+
features = features.astype(np.float32)
|
147 |
+
|
148 |
+
# Make prediction
|
149 |
+
prediction = model.predict(features)[0]
|
150 |
+
|
151 |
+
# Get the date for prediction
|
152 |
+
prediction_date = recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)
|
153 |
+
|
154 |
+
return prediction, prediction_date.strftime('%Y-%m-%d')
|
155 |
+
|
156 |
+
except Exception as e:
|
157 |
+
print(f"Error making prediction: {e}")
|
158 |
+
# Return a reasonable fallback
|
159 |
+
return 0, (recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)).strftime('%Y-%m-%d')
|
160 |
+
|
161 |
+
# Get available datasets
|
162 |
+
def get_available_datasets():
|
163 |
+
"""Get list of available datasets in the current directory"""
|
164 |
+
datasets = [f for f in os.listdir() if f.endswith('.csv')]
|
165 |
+
return datasets
|
166 |
+
|
167 |
+
# Gradio interface function
|
168 |
+
def predict_covid_cases(model_name, dataset_name, prediction_days):
|
169 |
+
"""
|
170 |
+
Make COVID-19 predictions using the selected model and dataset
|
171 |
+
|
172 |
+
Parameters:
|
173 |
+
-----------
|
174 |
+
model_name : str
|
175 |
+
Name of the model to use for prediction
|
176 |
+
dataset_name : str
|
177 |
+
Name of the dataset to use for prediction
|
178 |
+
prediction_days : int
|
179 |
+
Number of days ahead to predict
|
180 |
+
|
181 |
+
Returns:
|
182 |
+
--------
|
183 |
+
Tuple[str, str]
|
184 |
+
Prediction results and path to the plot image
|
185 |
+
"""
|
186 |
+
# Load all necessary models and data
|
187 |
+
models = load_models()
|
188 |
+
if not models:
|
189 |
+
return "No trained models available. Please train the models first.", None
|
190 |
+
|
191 |
+
# Load scaler if available
|
192 |
+
scaler = None
|
193 |
+
if os.path.exists('scaler.pkl'):
|
194 |
+
try:
|
195 |
+
scaler = joblib.load('scaler.pkl')
|
196 |
+
except Exception as e:
|
197 |
+
print(f"Warning: Could not load scaler: {e}")
|
198 |
+
|
199 |
+
# Get recent data
|
200 |
+
try:
|
201 |
+
recent_data = get_recent_data(data_file=dataset_name)
|
202 |
+
except Exception as e:
|
203 |
+
return f"Error loading data from {dataset_name}: {str(e)}", None
|
204 |
+
|
205 |
+
# Load feature names
|
206 |
+
if not os.path.exists('features.txt'):
|
207 |
+
return "Features list not found. Please run preprocessing first.", None
|
208 |
+
|
209 |
+
with open('features.txt', 'r') as f:
|
210 |
+
feature_names = [line.strip() for line in f.readlines()]
|
211 |
+
|
212 |
+
# Make prediction using the selected model
|
213 |
+
try:
|
214 |
+
prediction, prediction_date = make_prediction(
|
215 |
+
models[model_name],
|
216 |
+
recent_data,
|
217 |
+
feature_names,
|
218 |
+
days_to_predict=int(prediction_days)
|
219 |
+
)
|
220 |
+
|
221 |
+
# Create output message
|
222 |
+
result = f"## COVID-19 Prediction Results\n\n"
|
223 |
+
result += f"### Model: {model_name}\n\n"
|
224 |
+
result += f"### Dataset: {dataset_name}\n\n"
|
225 |
+
result += f"### Prediction for {prediction_date}:\n"
|
226 |
+
result += f"**New confirmed cases: {int(prediction):,}**\n\n"
|
227 |
+
|
228 |
+
# Current cases for comparison
|
229 |
+
latest_date = recent_data['Date'].iloc[0].strftime('%Y-%m-%d')
|
230 |
+
latest_cases = recent_data['New_Confirmed'].iloc[0]
|
231 |
+
result += f"### Latest data ({latest_date}):\n"
|
232 |
+
result += f"**New confirmed cases: {int(latest_cases):,}**\n\n"
|
233 |
+
|
234 |
+
# Calculate percent change
|
235 |
+
percent_change = ((prediction - latest_cases) / latest_cases) * 100
|
236 |
+
change_direction = "increase" if percent_change > 0 else "decrease"
|
237 |
+
result += f"### This represents a {abs(percent_change):.2f}% {change_direction} from the latest data.\n"
|
238 |
+
|
239 |
+
# Force garbage collection
|
240 |
+
gc.collect()
|
241 |
+
|
242 |
+
# Add the recent cases plot
|
243 |
+
if os.path.exists('recent_cases.png'):
|
244 |
+
return result, 'recent_cases.png'
|
245 |
+
else:
|
246 |
+
return result, None
|
247 |
+
|
248 |
+
except Exception as e:
|
249 |
+
return f"Error making prediction: {str(e)}", None
|
250 |
+
|
251 |
+
# Create and launch the Gradio interface
|
252 |
+
def create_interface():
|
253 |
+
"""
|
254 |
+
Create the Gradio interface for COVID-19 prediction
|
255 |
+
|
256 |
+
Returns:
|
257 |
+
--------
|
258 |
+
gr.Blocks
|
259 |
+
Gradio interface
|
260 |
+
"""
|
261 |
+
# Load models to get available model names
|
262 |
+
models = load_models()
|
263 |
+
if not models:
|
264 |
+
model_names = ["No models available"]
|
265 |
+
else:
|
266 |
+
model_names = list(models.keys())
|
267 |
+
|
268 |
+
# Get available datasets
|
269 |
+
datasets = get_available_datasets()
|
270 |
+
if not datasets:
|
271 |
+
dataset_names = ["No datasets available"]
|
272 |
+
else:
|
273 |
+
dataset_names = datasets
|
274 |
+
|
275 |
+
# Create the interface
|
276 |
+
with gr.Blocks(title="COVID-19 Prediction Model") as demo:
|
277 |
+
gr.Markdown(
|
278 |
+
"""
|
279 |
+
# COVID-19 Case Prediction
|
280 |
+
|
281 |
+
This application uses regression models to predict future COVID-19 cases.
|
282 |
+
Select a model, dataset, and the number of days ahead to predict.
|
283 |
+
"""
|
284 |
+
)
|
285 |
+
|
286 |
+
with gr.Row():
|
287 |
+
with gr.Column():
|
288 |
+
model_dropdown = gr.Dropdown(
|
289 |
+
choices=model_names,
|
290 |
+
label="Select Model",
|
291 |
+
value=model_names[0] if model_names else None
|
292 |
+
)
|
293 |
+
|
294 |
+
dataset_dropdown = gr.Dropdown(
|
295 |
+
choices=dataset_names,
|
296 |
+
label="Select Dataset",
|
297 |
+
value="US_engineered_features.csv" if "US_engineered_features.csv" in dataset_names else (dataset_names[0] if dataset_names else None)
|
298 |
+
)
|
299 |
+
|
300 |
+
prediction_days = gr.Slider(
|
301 |
+
minimum=1,
|
302 |
+
maximum=14,
|
303 |
+
value=7,
|
304 |
+
step=1,
|
305 |
+
label="Days to Predict Ahead"
|
306 |
+
)
|
307 |
+
|
308 |
+
predict_button = gr.Button("Predict")
|
309 |
+
|
310 |
+
with gr.Column():
|
311 |
+
output_text = gr.Markdown("Select a model, dataset, and prediction timeframe, then click 'Predict'")
|
312 |
+
output_image = gr.Image(label="Recent Case Trends")
|
313 |
+
|
314 |
+
predict_button.click(
|
315 |
+
fn=predict_covid_cases,
|
316 |
+
inputs=[model_dropdown, dataset_dropdown, prediction_days],
|
317 |
+
outputs=[output_text, output_image]
|
318 |
+
)
|
319 |
+
|
320 |
+
gr.Markdown(
|
321 |
+
"""
|
322 |
+
### About the Models
|
323 |
+
|
324 |
+
- **Random Forest**: A powerful ensemble model that works well with many features
|
325 |
+
- **Linear Regression**: A simple but effective baseline model
|
326 |
+
- **SVR (Support Vector Regression)**: Good for capturing non-linear relationships
|
327 |
+
- **Gradient Boosting**: An ensemble model that builds trees sequentially
|
328 |
+
|
329 |
+
### Memory Usage
|
330 |
+
|
331 |
+
This application is optimized to handle multiple datasets of different types while minimizing memory usage.
|
332 |
+
"""
|
333 |
+
)
|
334 |
+
|
335 |
+
return demo
|
336 |
+
|
337 |
+
# Main function to start the Gradio app
|
338 |
+
def main():
|
339 |
+
"""Launch the Gradio app"""
|
340 |
+
demo = create_interface()
|
341 |
+
|
342 |
+
# Configure for both local and Hugging Face deployment
|
343 |
+
# For Hugging Face deployment, we need to make sure it's servable publicly
|
344 |
+
is_huggingface = os.environ.get('SPACE_ID') is not None
|
345 |
+
|
346 |
+
if is_huggingface:
|
347 |
+
print("Detected Hugging Face environment, configuring for Space deployment")
|
348 |
+
# For HF Spaces, specific configuration
|
349 |
+
demo.launch(
|
350 |
+
server_name="0.0.0.0", # Bind to all interfaces
|
351 |
+
share=False, # No need for sharing link in HF
|
352 |
+
favicon_path="https://huggingface.co/favicon.ico" # Use HF favicon
|
353 |
+
)
|
354 |
+
else:
|
355 |
+
# Local deployment
|
356 |
+
print("Configuring for local deployment")
|
357 |
+
demo.launch(
|
358 |
+
share=True, # Create a shareable link
|
359 |
+
debug=True # Show more error details
|
360 |
+
)
|
361 |
+
|
362 |
+
if __name__ == "__main__":
|
363 |
+
main()
|
hf_space.yml
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
sdk: gradio
|
2 |
+
sdk_version: 4.10.0
|
3 |
+
app_file: gradio_app.py
|
preprocess_data.py
ADDED
@@ -0,0 +1,392 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pandas as pd
|
2 |
+
import numpy as np
|
3 |
+
from sklearn.preprocessing import StandardScaler
|
4 |
+
from sklearn.model_selection import train_test_split
|
5 |
+
import os
|
6 |
+
import gc
|
7 |
+
from typing import Dict, List, Tuple, Union
|
8 |
+
|
9 |
+
class COVIDDataProcessor:
|
10 |
+
"""
|
11 |
+
Class to handle preprocessing of COVID-19 data from multiple datasets
|
12 |
+
with memory optimization.
|
13 |
+
"""
|
14 |
+
def __init__(self, data_config: Dict = None):
|
15 |
+
"""
|
16 |
+
Initialize the data processor
|
17 |
+
|
18 |
+
Parameters:
|
19 |
+
-----------
|
20 |
+
data_config : Dict
|
21 |
+
Configuration for loading datasets with column dtypes
|
22 |
+
"""
|
23 |
+
self.data_config = data_config or {}
|
24 |
+
self.datasets = {}
|
25 |
+
self.X_train = None
|
26 |
+
self.X_test = None
|
27 |
+
self.y_train = None
|
28 |
+
self.y_test = None
|
29 |
+
self.feature_cols = None
|
30 |
+
self.target_col = None
|
31 |
+
self.categorical_cols = []
|
32 |
+
self.date_cols = ['Date']
|
33 |
+
self.numerical_cols = []
|
34 |
+
self.scaler = None
|
35 |
+
|
36 |
+
@staticmethod
|
37 |
+
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
|
38 |
+
"""
|
39 |
+
Optimize memory usage by converting columns to appropriate dtypes
|
40 |
+
|
41 |
+
Parameters:
|
42 |
+
-----------
|
43 |
+
df : pd.DataFrame
|
44 |
+
The dataframe to optimize
|
45 |
+
|
46 |
+
Returns:
|
47 |
+
--------
|
48 |
+
pd.DataFrame
|
49 |
+
Memory-optimized dataframe
|
50 |
+
"""
|
51 |
+
# Convert integer columns to optimal integer type
|
52 |
+
for col in df.select_dtypes(include=['int']).columns:
|
53 |
+
if df[col].min() >= 0:
|
54 |
+
if df[col].max() <= 255:
|
55 |
+
df[col] = df[col].astype(np.uint8)
|
56 |
+
elif df[col].max() <= 65535:
|
57 |
+
df[col] = df[col].astype(np.uint16)
|
58 |
+
elif df[col].max() <= 4294967295:
|
59 |
+
df[col] = df[col].astype(np.uint32)
|
60 |
+
else:
|
61 |
+
df[col] = df[col].astype(np.uint64)
|
62 |
+
else:
|
63 |
+
if df[col].min() >= -128 and df[col].max() <= 127:
|
64 |
+
df[col] = df[col].astype(np.int8)
|
65 |
+
elif df[col].min() >= -32768 and df[col].max() <= 32767:
|
66 |
+
df[col] = df[col].astype(np.int16)
|
67 |
+
elif df[col].min() >= -2147483648 and df[col].max() <= 2147483647:
|
68 |
+
df[col] = df[col].astype(np.int32)
|
69 |
+
else:
|
70 |
+
df[col] = df[col].astype(np.int64)
|
71 |
+
|
72 |
+
# Convert float columns to float32 (usually sufficient precision)
|
73 |
+
for col in df.select_dtypes(include=['float']).columns:
|
74 |
+
df[col] = df[col].astype(np.float32)
|
75 |
+
|
76 |
+
# Categorical columns can be converted to 'category' dtype
|
77 |
+
for col in df.select_dtypes(include=['object']).columns:
|
78 |
+
if col != 'Date' and df[col].nunique() / len(df) < 0.5: # If it's not a date column and has less than 50% unique values
|
79 |
+
df[col] = df[col].astype('category')
|
80 |
+
|
81 |
+
return df
|
82 |
+
|
83 |
+
def load_dataset(self, name: str, file_path: str, optimize_memory: bool = True) -> pd.DataFrame:
|
84 |
+
"""
|
85 |
+
Load a dataset from file with memory optimization
|
86 |
+
|
87 |
+
Parameters:
|
88 |
+
-----------
|
89 |
+
name : str
|
90 |
+
Name to identify the dataset
|
91 |
+
file_path : str
|
92 |
+
Path to the dataset file
|
93 |
+
optimize_memory : bool
|
94 |
+
Whether to optimize memory usage
|
95 |
+
|
96 |
+
Returns:
|
97 |
+
--------
|
98 |
+
pd.DataFrame
|
99 |
+
The loaded dataset
|
100 |
+
"""
|
101 |
+
print(f"Loading dataset: {name} from {file_path}")
|
102 |
+
|
103 |
+
# Get column dtypes if specified in config
|
104 |
+
dtype_dict = self.data_config.get(name, {}).get('dtypes', None)
|
105 |
+
|
106 |
+
# Load with chunk size for large files to avoid memory issues
|
107 |
+
if file_path.endswith('.csv'):
|
108 |
+
try:
|
109 |
+
if dtype_dict:
|
110 |
+
df = pd.read_csv(file_path, dtype=dtype_dict)
|
111 |
+
else:
|
112 |
+
# For large files, read in chunks and concatenate
|
113 |
+
chunks = []
|
114 |
+
for chunk in pd.read_csv(file_path, chunksize=100000):
|
115 |
+
if optimize_memory:
|
116 |
+
chunk = self.optimize_dtypes(chunk)
|
117 |
+
chunks.append(chunk)
|
118 |
+
|
119 |
+
df = pd.concat(chunks, axis=0)
|
120 |
+
del chunks
|
121 |
+
gc.collect()
|
122 |
+
except Exception as e:
|
123 |
+
print(f"Error loading CSV: {e}")
|
124 |
+
return None
|
125 |
+
else:
|
126 |
+
print(f"Unsupported file format: {file_path}")
|
127 |
+
return None
|
128 |
+
|
129 |
+
# Convert date columns
|
130 |
+
if 'Date' in df.columns:
|
131 |
+
df['Date'] = pd.to_datetime(df['Date'])
|
132 |
+
|
133 |
+
# Optimize memory usage
|
134 |
+
if optimize_memory and dtype_dict is None:
|
135 |
+
df = self.optimize_dtypes(df)
|
136 |
+
|
137 |
+
# Store dataset
|
138 |
+
self.datasets[name] = df
|
139 |
+
|
140 |
+
print(f"Dataset {name} loaded: {df.shape} - Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
|
141 |
+
return df
|
142 |
+
|
143 |
+
def merge_datasets(self, datasets: List[str], on: List[str], how: str = 'inner') -> pd.DataFrame:
|
144 |
+
"""
|
145 |
+
Merge multiple datasets
|
146 |
+
|
147 |
+
Parameters:
|
148 |
+
-----------
|
149 |
+
datasets : List[str]
|
150 |
+
List of dataset names to merge
|
151 |
+
on : List[str]
|
152 |
+
Columns to merge on
|
153 |
+
how : str
|
154 |
+
Type of merge to perform
|
155 |
+
|
156 |
+
Returns:
|
157 |
+
--------
|
158 |
+
pd.DataFrame
|
159 |
+
Merged dataset
|
160 |
+
"""
|
161 |
+
if not datasets or len(datasets) < 2:
|
162 |
+
print("Need at least two datasets to merge")
|
163 |
+
return None
|
164 |
+
|
165 |
+
# Start with the first dataset
|
166 |
+
result = self.datasets[datasets[0]].copy()
|
167 |
+
|
168 |
+
# Merge with the rest
|
169 |
+
for i in range(1, len(datasets)):
|
170 |
+
result = result.merge(self.datasets[datasets[i]], on=on, how=how)
|
171 |
+
gc.collect() # Force garbage collection to free memory
|
172 |
+
|
173 |
+
print(f"Merged dataset shape: {result.shape} - Memory usage: {result.memory_usage().sum() / 1024**2:.2f} MB")
|
174 |
+
return result
|
175 |
+
|
176 |
+
def load_data(file_path='US_engineered_features.csv', optimize_memory=True):
|
177 |
+
"""
|
178 |
+
Load and preprocess the COVID-19 data
|
179 |
+
"""
|
180 |
+
# Load the data
|
181 |
+
print(f"Loading data from {file_path}...")
|
182 |
+
|
183 |
+
# For large files, read in chunks
|
184 |
+
if optimize_memory:
|
185 |
+
chunks = []
|
186 |
+
for chunk in pd.read_csv(file_path, chunksize=100000):
|
187 |
+
# Optimize dtypes for each chunk
|
188 |
+
# Convert integer columns to optimal integer type
|
189 |
+
for col in chunk.select_dtypes(include=['int']).columns:
|
190 |
+
if chunk[col].min() >= 0:
|
191 |
+
if chunk[col].max() <= 255:
|
192 |
+
chunk[col] = chunk[col].astype(np.uint8)
|
193 |
+
elif chunk[col].max() <= 65535:
|
194 |
+
chunk[col] = chunk[col].astype(np.uint16)
|
195 |
+
else:
|
196 |
+
chunk[col] = chunk[col].astype(np.uint32)
|
197 |
+
else:
|
198 |
+
if chunk[col].min() >= -128 and chunk[col].max() <= 127:
|
199 |
+
chunk[col] = chunk[col].astype(np.int8)
|
200 |
+
elif chunk[col].min() >= -32768 and chunk[col].max() <= 32767:
|
201 |
+
chunk[col] = chunk[col].astype(np.int16)
|
202 |
+
else:
|
203 |
+
chunk[col] = chunk[col].astype(np.int32)
|
204 |
+
|
205 |
+
# Convert float columns to float32
|
206 |
+
for col in chunk.select_dtypes(include=['float']).columns:
|
207 |
+
chunk[col] = chunk[col].astype(np.float32)
|
208 |
+
|
209 |
+
chunks.append(chunk)
|
210 |
+
|
211 |
+
df = pd.concat(chunks, axis=0)
|
212 |
+
del chunks
|
213 |
+
gc.collect() # Force garbage collection
|
214 |
+
else:
|
215 |
+
df = pd.read_csv(file_path)
|
216 |
+
|
217 |
+
# Convert Date to datetime
|
218 |
+
if 'Date' in df.columns:
|
219 |
+
df['Date'] = pd.to_datetime(df['Date'])
|
220 |
+
|
221 |
+
# Display basic information
|
222 |
+
print(f"Data shape: {df.shape}")
|
223 |
+
if 'Date' in df.columns:
|
224 |
+
print(f"Time period: {df['Date'].min()} to {df['Date'].max()}")
|
225 |
+
|
226 |
+
return df
|
227 |
+
|
228 |
+
def preprocess_data(df, target_column='New_Confirmed', prediction_days=7, test_size=0.2):
|
229 |
+
"""
|
230 |
+
Preprocess the data for regression modeling
|
231 |
+
|
232 |
+
Parameters:
|
233 |
+
- df: DataFrame containing the COVID-19 data
|
234 |
+
- target_column: The column to predict
|
235 |
+
- prediction_days: Number of days ahead to predict
|
236 |
+
- test_size: Proportion of data to use for testing
|
237 |
+
|
238 |
+
Returns:
|
239 |
+
- X_train, X_test, y_train, y_test: Train and test sets
|
240 |
+
- feature_names: Names of the features used for prediction
|
241 |
+
- scaler: The fitted scaler for inverse transformations
|
242 |
+
"""
|
243 |
+
# Convert Date to datetime if not already
|
244 |
+
if 'Date' in df.columns and not pd.api.types.is_datetime64_any_dtype(df['Date']):
|
245 |
+
df['Date'] = pd.to_datetime(df['Date'])
|
246 |
+
|
247 |
+
# Create a shifted target column for prediction
|
248 |
+
df[f'{target_column}_future_{prediction_days}d'] = df[target_column].shift(-prediction_days)
|
249 |
+
|
250 |
+
# Drop rows with NaN values (typically the last n rows where future data is not available)
|
251 |
+
df = df.dropna(subset=[f'{target_column}_future_{prediction_days}d'])
|
252 |
+
|
253 |
+
# Remove non-numeric columns and columns that would cause data leakage
|
254 |
+
non_feature_cols = ['Date', 'Country/Region', f'{target_column}_future_{prediction_days}d']
|
255 |
+
leakage_cols = [col for col in df.columns if 'future' in col and col != f'{target_column}_future_{prediction_days}d']
|
256 |
+
|
257 |
+
# For regression models, we'll use all available numeric features
|
258 |
+
features = df.columns.difference(non_feature_cols + leakage_cols)
|
259 |
+
|
260 |
+
# Select features and target
|
261 |
+
X = df[features]
|
262 |
+
y = df[f'{target_column}_future_{prediction_days}d']
|
263 |
+
|
264 |
+
# Fill missing values with median for numerical features or mode for categorical
|
265 |
+
for col in X.columns:
|
266 |
+
if X[col].isna().sum() > 0:
|
267 |
+
if np.issubdtype(X[col].dtype, np.number):
|
268 |
+
X[col] = X[col].fillna(X[col].median())
|
269 |
+
else:
|
270 |
+
X[col] = X[col].fillna(X[col].mode()[0])
|
271 |
+
|
272 |
+
# Split data into training and testing sets
|
273 |
+
X_train, X_test, y_train, y_test = train_test_split(
|
274 |
+
X, y, test_size=test_size, shuffle=False
|
275 |
+
)
|
276 |
+
|
277 |
+
# Convert pandas DataFrames to NumPy arrays to save memory
|
278 |
+
X_train_np = X_train.values
|
279 |
+
X_test_np = X_test.values
|
280 |
+
|
281 |
+
# Scale the features
|
282 |
+
scaler = StandardScaler()
|
283 |
+
X_train_scaled = scaler.fit_transform(X_train_np)
|
284 |
+
X_test_scaled = scaler.transform(X_test_np)
|
285 |
+
|
286 |
+
# Release memory
|
287 |
+
del X_train_np, X_test_np
|
288 |
+
gc.collect()
|
289 |
+
|
290 |
+
print(f"Training set shape: {X_train.shape}")
|
291 |
+
print(f"Testing set shape: {X_test.shape}")
|
292 |
+
print(f"Features used: {len(features)}")
|
293 |
+
|
294 |
+
return X_train_scaled, X_test_scaled, y_train, y_test, list(features), scaler
|
295 |
+
|
296 |
+
def main():
|
297 |
+
"""
|
298 |
+
Main function to demonstrate data preprocessing
|
299 |
+
"""
|
300 |
+
# Define data configuration for memory optimization
|
301 |
+
data_config = {
|
302 |
+
'covid_full': {
|
303 |
+
'dtypes': {
|
304 |
+
'Country/Region': 'category',
|
305 |
+
'Confirmed': 'int32',
|
306 |
+
'Deaths': 'int32',
|
307 |
+
'Recovered': 'int32',
|
308 |
+
'New_Confirmed': 'int32',
|
309 |
+
'New_Deaths': 'int32',
|
310 |
+
'New_Recovered': 'int32',
|
311 |
+
}
|
312 |
+
},
|
313 |
+
'us_engineered': {
|
314 |
+
'dtypes': {
|
315 |
+
'Country/Region': 'category',
|
316 |
+
'Confirmed': 'int32',
|
317 |
+
'Deaths': 'int32',
|
318 |
+
'New_Confirmed': 'int32',
|
319 |
+
'New_Deaths': 'int32',
|
320 |
+
'Day': 'int8',
|
321 |
+
'Day_of_week': 'int8',
|
322 |
+
'Month': 'int8',
|
323 |
+
'Year': 'int16',
|
324 |
+
}
|
325 |
+
}
|
326 |
+
}
|
327 |
+
|
328 |
+
# Choose the approach:
|
329 |
+
# 1. Basic approach (backward compatible)
|
330 |
+
# 2. Advanced multi-dataset approach
|
331 |
+
approach = 2 # Change to 1 for basic approach
|
332 |
+
|
333 |
+
if approach == 1:
|
334 |
+
# Basic approach (legacy code)
|
335 |
+
print("Using basic approach...")
|
336 |
+
df = load_data(optimize_memory=True)
|
337 |
+
X_train, X_test, y_train, y_test, features, scaler = preprocess_data(df)
|
338 |
+
else:
|
339 |
+
# Advanced multi-dataset approach
|
340 |
+
print("Using advanced multi-dataset approach...")
|
341 |
+
processor = COVIDDataProcessor(data_config)
|
342 |
+
|
343 |
+
# Load datasets
|
344 |
+
processor.load_dataset('covid_full', 'covid_full_dataset.csv')
|
345 |
+
processor.load_dataset('us_engineered', 'US_engineered_features.csv')
|
346 |
+
|
347 |
+
# Example: You can also load other datasets and merge them
|
348 |
+
# processor.load_dataset('raw_confirmed', 'raw_confirmed.csv')
|
349 |
+
# processor.load_dataset('raw_deaths', 'raw_deaths.csv')
|
350 |
+
# processor.merge_datasets(['raw_confirmed', 'raw_deaths'], on=['Date', 'Country/Region'])
|
351 |
+
|
352 |
+
# For simplicity, we'll just use the US engineered features dataset
|
353 |
+
# But you could use any merged or single dataset here
|
354 |
+
X_train, X_test, y_train, y_test, features, scaler = preprocess_data(
|
355 |
+
processor.datasets['us_engineered']
|
356 |
+
)
|
357 |
+
|
358 |
+
print("\nPreprocessing complete!")
|
359 |
+
print(f"Number of training samples: {len(X_train)}")
|
360 |
+
print(f"Number of testing samples: {len(X_test)}")
|
361 |
+
print(f"Target range: {y_train.min()} to {y_train.max()}")
|
362 |
+
|
363 |
+
# Save the preprocessed data - using numpy's compressed format to save space
|
364 |
+
np.savez_compressed('X_train.npz', data=X_train)
|
365 |
+
np.savez_compressed('X_test.npz', data=X_test)
|
366 |
+
np.savez_compressed('y_train.npz', data=y_train)
|
367 |
+
np.savez_compressed('y_test.npz', data=y_test)
|
368 |
+
|
369 |
+
# Also save as .npy for backward compatibility
|
370 |
+
np.save('X_train.npy', X_train)
|
371 |
+
np.save('X_test.npy', X_test)
|
372 |
+
np.save('y_train.npy', y_train if isinstance(y_train, np.ndarray) else y_train.values)
|
373 |
+
np.save('y_test.npy', y_test if isinstance(y_test, np.ndarray) else y_test.values)
|
374 |
+
|
375 |
+
# Save features list
|
376 |
+
with open('features.txt', 'w') as f:
|
377 |
+
for feature in features:
|
378 |
+
f.write(f"{feature}\n")
|
379 |
+
|
380 |
+
# Save scaler
|
381 |
+
import joblib
|
382 |
+
joblib.dump(scaler, 'scaler.pkl')
|
383 |
+
|
384 |
+
print("Preprocessed data saved!")
|
385 |
+
|
386 |
+
# Print memory usage stats
|
387 |
+
import psutil
|
388 |
+
process = psutil.Process(os.getpid())
|
389 |
+
print(f"Current memory usage: {process.memory_info().rss / (1024 * 1024):.2f} MB")
|
390 |
+
|
391 |
+
if __name__ == "__main__":
|
392 |
+
main()
|
raw_confirmed.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
raw_deaths.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
raw_owid.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b2c25df2d17be38533d88d9b714d80050f615519ab39f1ef5831ae38db7dff46
|
3 |
+
size 107886083
|
raw_recovered.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
requirements.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
pandas==2.1.0
|
2 |
+
numpy==1.26.0
|
3 |
+
matplotlib==3.8.0
|
4 |
+
seaborn==0.13.0
|
5 |
+
scikit-learn==1.3.0
|
6 |
+
xgboost==2.0.0
|
7 |
+
lightgbm==4.1.0
|
8 |
+
shap==0.43.0
|
9 |
+
gradio==4.10.0
|
10 |
+
joblib==1.3.2
|
run_pipeline.py
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
COVID-19 Prediction Pipeline
|
4 |
+
---------------------------
|
5 |
+
This script runs the complete pipeline from data preprocessing to model training
|
6 |
+
and launches the Gradio UI.
|
7 |
+
"""
|
8 |
+
|
9 |
+
import os
|
10 |
+
import argparse
|
11 |
+
import subprocess
|
12 |
+
import time
|
13 |
+
|
14 |
+
def clear_screen():
|
15 |
+
"""Clear the terminal screen"""
|
16 |
+
os.system('cls' if os.name == 'nt' else 'clear')
|
17 |
+
|
18 |
+
def run_command(command, description):
|
19 |
+
"""Run a system command and print output"""
|
20 |
+
print(f"\n{'=' * 80}")
|
21 |
+
print(f"STEP: {description}")
|
22 |
+
print(f"{'=' * 80}\n")
|
23 |
+
print(f"Running: {command}\n")
|
24 |
+
|
25 |
+
# Run the command
|
26 |
+
start_time = time.time()
|
27 |
+
result = subprocess.run(command, shell=True)
|
28 |
+
end_time = time.time()
|
29 |
+
|
30 |
+
# Check if command was successful
|
31 |
+
if result.returncode == 0:
|
32 |
+
print(f"\nSuccess! Completed in {end_time - start_time:.2f} seconds")
|
33 |
+
else:
|
34 |
+
print(f"\nError! Command failed with exit code {result.returncode}")
|
35 |
+
exit(1)
|
36 |
+
|
37 |
+
print(f"\n{'=' * 80}\n")
|
38 |
+
time.sleep(1)
|
39 |
+
|
40 |
+
def parse_args():
|
41 |
+
"""Parse command-line arguments"""
|
42 |
+
parser = argparse.ArgumentParser(description="COVID-19 Prediction Pipeline")
|
43 |
+
parser.add_argument("--skip-preprocessing", action="store_true", help="Skip data preprocessing")
|
44 |
+
parser.add_argument("--skip-training", action="store_true", help="Skip model training")
|
45 |
+
parser.add_argument("--only-ui", action="store_true", help="Only launch the Gradio UI")
|
46 |
+
return parser.parse_args()
|
47 |
+
|
48 |
+
def main():
|
49 |
+
"""Run the complete pipeline"""
|
50 |
+
args = parse_args()
|
51 |
+
|
52 |
+
# Display welcome banner
|
53 |
+
clear_screen()
|
54 |
+
print("\n" + "=" * 80)
|
55 |
+
print("COVID-19 PREDICTION PIPELINE".center(80))
|
56 |
+
print("=" * 80 + "\n")
|
57 |
+
print("This script will run the complete pipeline:")
|
58 |
+
print("1. Data preprocessing")
|
59 |
+
print("2. Model training")
|
60 |
+
print("3. Launch Gradio UI for predictions")
|
61 |
+
print("\nPress Ctrl+C at any time to stop the pipeline.")
|
62 |
+
print()
|
63 |
+
|
64 |
+
try:
|
65 |
+
# Step 1: Data Preprocessing
|
66 |
+
if args.only_ui:
|
67 |
+
print("Skipping preprocessing and training, launching UI only...")
|
68 |
+
else:
|
69 |
+
if not args.skip_preprocessing:
|
70 |
+
run_command("python preprocess_data.py", "Data Preprocessing")
|
71 |
+
else:
|
72 |
+
print("Skipping preprocessing as requested.")
|
73 |
+
|
74 |
+
# Step 2: Model Training
|
75 |
+
if not args.skip_training:
|
76 |
+
run_command("python train_models.py", "Model Training")
|
77 |
+
else:
|
78 |
+
print("Skipping model training as requested.")
|
79 |
+
|
80 |
+
# Step 3: Launch Gradio UI
|
81 |
+
print("\nLaunching Gradio UI for predictions...")
|
82 |
+
run_command("python gradio_app.py", "Gradio UI Launch")
|
83 |
+
|
84 |
+
except KeyboardInterrupt:
|
85 |
+
print("\n\nPipeline interrupted by user. Exiting.")
|
86 |
+
exit(0)
|
87 |
+
|
88 |
+
except Exception as e:
|
89 |
+
print(f"\n\nError in pipeline: {str(e)}")
|
90 |
+
exit(1)
|
91 |
+
|
92 |
+
if __name__ == "__main__":
|
93 |
+
main()
|
train_models.py
ADDED
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import numpy as np
|
2 |
+
import pandas as pd
|
3 |
+
from sklearn.linear_model import LinearRegression
|
4 |
+
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
|
5 |
+
from sklearn.svm import SVR
|
6 |
+
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
|
7 |
+
import matplotlib.pyplot as plt
|
8 |
+
import seaborn as sns
|
9 |
+
import joblib
|
10 |
+
import os
|
11 |
+
import gc
|
12 |
+
import psutil
|
13 |
+
from typing import Dict, List, Tuple
|
14 |
+
|
15 |
+
# Set the style for plots
|
16 |
+
sns.set(style="whitegrid")
|
17 |
+
|
18 |
+
# Set up memory monitoring
|
19 |
+
def print_memory_usage():
|
20 |
+
process = psutil.Process(os.getpid())
|
21 |
+
memory_usage = process.memory_info().rss / (1024 * 1024) # Convert to MB
|
22 |
+
print(f"Current memory usage: {memory_usage:.2f} MB")
|
23 |
+
|
24 |
+
def train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names=None):
|
25 |
+
"""
|
26 |
+
Train and evaluate multiple regression models for COVID-19 prediction
|
27 |
+
|
28 |
+
Parameters:
|
29 |
+
- X_train, X_test, y_train, y_test: Training and testing data
|
30 |
+
- feature_names: List of feature names (for feature importance)
|
31 |
+
|
32 |
+
Returns:
|
33 |
+
- models: Dictionary of trained models
|
34 |
+
- metrics: Dictionary of evaluation metrics for each model
|
35 |
+
"""
|
36 |
+
models = {
|
37 |
+
'Linear Regression': LinearRegression(),
|
38 |
+
'Support Vector Regression': SVR(kernel='rbf', gamma='scale', C=1.0, epsilon=0.1),
|
39 |
+
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
|
40 |
+
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
|
41 |
+
}
|
42 |
+
|
43 |
+
metrics = {
|
44 |
+
'Model': [],
|
45 |
+
'RMSE': [],
|
46 |
+
'MAE': [],
|
47 |
+
'R²': []
|
48 |
+
}
|
49 |
+
|
50 |
+
for name, model in models.items():
|
51 |
+
print(f"Training {name}...")
|
52 |
+
model.fit(X_train, y_train)
|
53 |
+
|
54 |
+
# Predict
|
55 |
+
y_pred = model.predict(X_test)
|
56 |
+
|
57 |
+
# Calculate metrics
|
58 |
+
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
|
59 |
+
mae = mean_absolute_error(y_test, y_pred)
|
60 |
+
r2 = r2_score(y_test, y_pred)
|
61 |
+
|
62 |
+
# Store metrics
|
63 |
+
metrics['Model'].append(name)
|
64 |
+
metrics['RMSE'].append(rmse)
|
65 |
+
metrics['MAE'].append(mae)
|
66 |
+
metrics['R²'].append(r2)
|
67 |
+
|
68 |
+
print(f"{name} - RMSE: {rmse:.2f}, MAE: {mae:.2f}, R²: {r2:.4f}")
|
69 |
+
|
70 |
+
# Save the model
|
71 |
+
joblib.dump(model, f'{name.replace(" ", "_").lower()}_model.pkl')
|
72 |
+
|
73 |
+
# Plot actual vs predicted
|
74 |
+
plt.figure(figsize=(10, 6))
|
75 |
+
plt.scatter(y_test, y_pred, alpha=0.5)
|
76 |
+
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
|
77 |
+
plt.title(f'{name} - Actual vs Predicted')
|
78 |
+
plt.xlabel('Actual')
|
79 |
+
plt.ylabel('Predicted')
|
80 |
+
plt.savefig(f'{name.replace(" ", "_").lower()}_predictions.png')
|
81 |
+
|
82 |
+
# If it's Random Forest or Gradient Boosting, plot feature importance
|
83 |
+
if name in ['Random Forest', 'Gradient Boosting'] and feature_names is not None:
|
84 |
+
plt.figure(figsize=(12, 8))
|
85 |
+
feature_importance = model.feature_importances_
|
86 |
+
sorted_idx = np.argsort(feature_importance)
|
87 |
+
|
88 |
+
# Select top 15 features for better visualization
|
89 |
+
top_k = min(15, len(feature_importance))
|
90 |
+
plt.barh(range(top_k), feature_importance[sorted_idx][-top_k:])
|
91 |
+
plt.yticks(range(top_k), [feature_names[i] for i in sorted_idx[-top_k:]])
|
92 |
+
plt.title(f'{name} - Top {top_k} Feature Importance')
|
93 |
+
plt.tight_layout()
|
94 |
+
plt.savefig(f'{name.replace(" ", "_").lower()}_feature_importance.png')
|
95 |
+
|
96 |
+
# Plot comparison of models
|
97 |
+
metrics_df = pd.DataFrame(metrics)
|
98 |
+
|
99 |
+
# Create bar plot for RMSE and MAE
|
100 |
+
plt.figure(figsize=(12, 6))
|
101 |
+
|
102 |
+
bar_width = 0.35
|
103 |
+
index = np.arange(len(metrics_df['Model']))
|
104 |
+
|
105 |
+
plt.bar(index, metrics_df['RMSE'], bar_width, label='RMSE')
|
106 |
+
plt.bar(index + bar_width, metrics_df['MAE'], bar_width, label='MAE')
|
107 |
+
|
108 |
+
plt.xlabel('Model')
|
109 |
+
plt.ylabel('Error')
|
110 |
+
plt.title('Model Comparison - RMSE and MAE')
|
111 |
+
plt.xticks(index + bar_width / 2, metrics_df['Model'], rotation=45)
|
112 |
+
plt.legend()
|
113 |
+
plt.tight_layout()
|
114 |
+
plt.savefig('model_comparison_error.png')
|
115 |
+
|
116 |
+
# Create bar plot for R²
|
117 |
+
plt.figure(figsize=(12, 6))
|
118 |
+
plt.bar(metrics_df['Model'], metrics_df['R²'], color='skyblue')
|
119 |
+
plt.xlabel('Model')
|
120 |
+
plt.ylabel('R²')
|
121 |
+
plt.title('Model Comparison - R²')
|
122 |
+
plt.xticks(rotation=45)
|
123 |
+
plt.tight_layout()
|
124 |
+
plt.savefig('model_comparison_r2.png')
|
125 |
+
|
126 |
+
print("\nModel training and evaluation complete!")
|
127 |
+
print(f"Models saved as: {', '.join([f'{name.replace(' ', '_').lower()}_model.pkl' for name in models.keys()])}")
|
128 |
+
|
129 |
+
return models, metrics_df
|
130 |
+
|
131 |
+
def main():
|
132 |
+
"""
|
133 |
+
Main function to train and evaluate models
|
134 |
+
"""
|
135 |
+
# Check if preprocessed data exists
|
136 |
+
if not all(os.path.exists(f) for f in ['X_train.npy', 'X_test.npy', 'y_train.npy', 'y_test.npy']):
|
137 |
+
print("Preprocessed data not found. Please run preprocess_data.py first.")
|
138 |
+
return
|
139 |
+
|
140 |
+
# Load preprocessed data
|
141 |
+
X_train = np.load('X_train.npy')
|
142 |
+
X_test = np.load('X_test.npy')
|
143 |
+
y_train = np.load('y_train.npy')
|
144 |
+
y_test = np.load('y_test.npy')
|
145 |
+
|
146 |
+
# Load feature names
|
147 |
+
feature_names = []
|
148 |
+
if os.path.exists('features.txt'):
|
149 |
+
with open('features.txt', 'r') as f:
|
150 |
+
feature_names = [line.strip() for line in f.readlines()]
|
151 |
+
|
152 |
+
print("Data loaded successfully!")
|
153 |
+
print(f"Training data shape: {X_train.shape}")
|
154 |
+
print(f"Testing data shape: {X_test.shape}")
|
155 |
+
|
156 |
+
# Train and evaluate models
|
157 |
+
models, metrics = train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names)
|
158 |
+
|
159 |
+
# Display and save comparison table
|
160 |
+
print("\nModel Comparison:")
|
161 |
+
print(metrics)
|
162 |
+
metrics.to_csv('model_comparison.csv', index=False)
|
163 |
+
|
164 |
+
if __name__ == "__main__":
|
165 |
+
main()
|