vishwak1 commited on
Commit
fb61aba
·
verified ·
1 Parent(s): f0dd4ab

Upload 14 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ covid_country_ts.csv filter=lfs diff=lfs merge=lfs -text
37
+ covid_full_dataset.csv filter=lfs diff=lfs merge=lfs -text
38
+ raw_owid.csv filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,13 +1,119 @@
1
- ---
2
- title: Disease Prediction
3
- emoji: 🌖
4
- colorFrom: blue
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.31.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: project
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # COVID-19 Prediction Model
2
+
3
+ This project implements a COVID-19 prediction system using regression models with a focus on Random Forest and three other regression models. The system includes a Gradio user interface for Hugging Face deployment.
4
+
5
+ ## Features
6
+
7
+ - **Memory-optimized** data processing that can handle multiple datasets of different types and object types
8
+ - **Multiple regression models** for comparison:
9
+ - Random Forest Regression
10
+ - Linear Regression
11
+ - Support Vector Regression (SVR)
12
+ - Gradient Boosting Regression
13
+ - **Gradio UI** for easy model selection, visualization, and deployment to Hugging Face Spaces
14
+ - Complete **data preprocessing pipeline** with feature engineering
15
+ - **Performance evaluation** metrics and visualization
16
+
17
+ ## Project Structure
18
+
19
+ ```
20
+ COVID-19-Prediction/
21
+ ├── covid_full_dataset.csv # Complete COVID-19 dataset
22
+ ├── US_engineered_features.csv # Engineered features for US data
23
+ ├── raw_confirmed.csv # Raw confirmed cases data
24
+ ├── raw_deaths.csv # Raw deaths data
25
+ ├── raw_recovered.csv # Raw recovered cases data
26
+ ├── raw_owid.csv # Additional data from Our World in Data
27
+ ├── covid_country_ts.csv # Country-level time series data
28
+ ├── preprocess_data.py # Data preprocessing script
29
+ ├── train_models.py # Model training script
30
+ ├── gradio_app.py # Gradio UI for predictions
31
+ ├── run_pipeline.py # Complete pipeline runner
32
+ └── requirements.txt # Project dependencies
33
+ ```
34
+
35
+ ## Installation
36
+
37
+ 1. Clone this repository:
38
+ ```
39
+ git clone https://github.com/yourusername/covid19-prediction.git
40
+ cd covid19-prediction
41
+ ```
42
+
43
+ 2. Install the required packages:
44
+ ```
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ ## Usage
49
+
50
+ ### Run the Complete Pipeline
51
+
52
+ To run the complete pipeline (preprocessing, training, and UI):
53
+
54
+ ```
55
+ python run_pipeline.py
56
+ ```
57
+
58
+ ### Pipeline Options
59
+
60
+ - Skip preprocessing: `python run_pipeline.py --skip-preprocessing`
61
+ - Skip training: `python run_pipeline.py --skip-training`
62
+ - Only launch UI: `python run_pipeline.py --only-ui`
63
+
64
+ ### Run Individual Steps
65
+
66
+ 1. **Data Preprocessing**:
67
+ ```
68
+ python preprocess_data.py
69
+ ```
70
+
71
+ 2. **Model Training**:
72
+ ```
73
+ python train_models.py
74
+ ```
75
+
76
+ 3. **Launch Gradio UI**:
77
+ ```
78
+ python gradio_app.py
79
+ ```
80
+
81
+ ## Memory Optimization
82
+
83
+ This project is optimized to handle large datasets efficiently:
84
+
85
+ - Uses appropriate data types to minimize memory footprint
86
+ - Processes data in chunks for large files
87
+ - Employs garbage collection to free memory
88
+ - Uses compressed NumPy formats for storing processed data
89
+ - Optimizes model parameters for memory efficiency
90
+
91
+ ## Models
92
+
93
+ The project implements and compares four regression models:
94
+
95
+ 1. **Random Forest Regressor**: An ensemble learning method that builds multiple decision trees and merges their predictions.
96
+ 2. **Linear Regression**: A simple baseline model that assumes a linear relationship between features and target.
97
+ 3. **Support Vector Regression (SVR)**: Uses support vectors to create a regression model that can capture non-linear relationships.
98
+ 4. **Gradient Boosting Regressor**: An ensemble technique that builds trees sequentially, with each tree correcting errors made by previous ones.
99
+
100
+ ## Hugging Face Deployment
101
+
102
+ The Gradio interface is configured for easy deployment to Hugging Face Spaces:
103
+
104
+ 1. Create a new Space on Hugging Face
105
+ 2. Upload all files to the Space
106
+ 3. The app will automatically configure for the Hugging Face environment
107
+
108
+ ## Contributing
109
+
110
+ Contributions are welcome! Please feel free to submit a Pull Request.
111
+
112
+ ## License
113
+
114
+ This project is licensed under the MIT License - see the LICENSE file for details.
115
+
116
+ ## Acknowledgments
117
+
118
+ - Data sources: Johns Hopkins CSSE, Our World in Data
119
+ - Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Gradio
US_engineered_features.csv ADDED
The diff for this file is too large to render. See raw diff
 
covid_country_ts.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1af870f40973ef368b35e81c247ffde415e73db91bf80764412a17d86847474d
3
+ size 19035987
covid_full_dataset.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:382205dd481f9868e30e5d3574714061e9e9c492015061934d5bc92176cbed04
3
+ size 44431554
gradio_app.py ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import pandas as pd
3
+ import numpy as np
4
+ import joblib
5
+ from datetime import datetime, timedelta
6
+ import os
7
+ import matplotlib.pyplot as plt
8
+ import gc
9
+ from typing import Dict, List, Tuple, Union, Any
10
+
11
+ # Load the models and scaler
12
+ def load_models():
13
+ models = {}
14
+ model_files = [f for f in os.listdir() if f.endswith('_model.pkl')]
15
+
16
+ if not model_files:
17
+ print("No trained models found. Please run train_models.py first.")
18
+ return None
19
+
20
+ for model_file in model_files:
21
+ model_name = ' '.join([word.capitalize() for word in model_file.replace('_model.pkl', '').split('_')])
22
+ models[model_name] = joblib.load(model_file)
23
+
24
+ print(f"Loaded {len(models)} models: {', '.join(models.keys())}")
25
+ return models
26
+
27
+ # Create a function to get most recent data for prediction
28
+ def get_recent_data(data_file='US_engineered_features.csv', rows=30, optimize_memory=True):
29
+ """
30
+ Load and process recent data for prediction
31
+
32
+ Parameters:
33
+ -----------
34
+ data_file : str
35
+ Path to the data file
36
+ rows : int
37
+ Number of rows to retrieve from the end of the dataset
38
+ optimize_memory : bool
39
+ Whether to optimize memory usage
40
+
41
+ Returns:
42
+ --------
43
+ pd.DataFrame
44
+ Recent data sorted by date
45
+ """
46
+ # For memory optimization, define dtypes for critical columns
47
+ if optimize_memory:
48
+ dtype_dict = {
49
+ 'New_Confirmed': 'int32',
50
+ 'Deaths': 'int32',
51
+ 'Confirmed': 'int32',
52
+ 'Country/Region': 'category'
53
+ }
54
+
55
+ # Read only necessary columns if file is large
56
+ try:
57
+ # First check the file size
58
+ file_size = os.path.getsize(data_file) / (1024 * 1024) # Size in MB
59
+
60
+ if file_size > 100: # If file is larger than 100MB
61
+ # Get column list first
62
+ df_cols = pd.read_csv(data_file, nrows=1).columns.tolist()
63
+
64
+ # Define essential columns for prediction
65
+ essential_cols = ['Date', 'Country/Region', 'New_Confirmed', 'Deaths', 'Confirmed',
66
+ 'Recovered', 'New_Deaths', 'New_Recovered',
67
+ 'population', 'population_density', 'median_age']
68
+
69
+ # Filter to columns that exist in the dataset
70
+ cols_to_use = [col for col in essential_cols if col in df_cols]
71
+
72
+ # Read only the tail portion of the file for efficiency
73
+ df = pd.read_csv(data_file,
74
+ usecols=cols_to_use,
75
+ dtype={col: dtype_dict.get(col, None) for col in cols_to_use if col in dtype_dict})
76
+ else:
77
+ df = pd.read_csv(data_file, dtype=dtype_dict)
78
+ except Exception as e:
79
+ print(f"Error optimizing data load: {e}")
80
+ # Fall back to standard loading
81
+ df = pd.read_csv(data_file)
82
+ else:
83
+ df = pd.read_csv(data_file)
84
+
85
+ # Convert Date to datetime
86
+ df['Date'] = pd.to_datetime(df['Date'])
87
+
88
+ # Sort and get recent data
89
+ df = df.sort_values('Date', ascending=False).head(rows)
90
+
91
+ # Create a plot of recent confirmed cases
92
+ plt.figure(figsize=(10, 6))
93
+ plt.plot(df['Date'], df['New_Confirmed'], marker='o')
94
+ plt.title('Recent New Confirmed COVID-19 Cases')
95
+ plt.xlabel('Date')
96
+ plt.ylabel('New Confirmed Cases')
97
+ plt.xticks(rotation=45)
98
+ plt.tight_layout()
99
+ plt.savefig('recent_cases.png')
100
+ plt.close() # Close to free memory
101
+
102
+ return df
103
+
104
+ # Function to create predictions
105
+ def make_prediction(model, data, feature_names, days_to_predict=7):
106
+ """
107
+ Make prediction using the selected model and data
108
+
109
+ Parameters:
110
+ -----------
111
+ model : object
112
+ Trained model with predict method
113
+ data : pd.DataFrame
114
+ Data to make prediction on
115
+ feature_names : List[str]
116
+ Names of the features used for prediction
117
+ days_to_predict : int
118
+ Number of days ahead to predict
119
+
120
+ Returns:
121
+ --------
122
+ Tuple[float, str]
123
+ Prediction value and prediction date
124
+ """
125
+ # Get the most recent row
126
+ recent_data = data.iloc[0:1]
127
+
128
+ # Handle missing features - fill with median/mode values if needed
129
+ missing_features = [f for f in feature_names if f not in recent_data.columns]
130
+ if missing_features:
131
+ print(f"Warning: {len(missing_features)} features are missing from the dataset and will be filled with defaults")
132
+ for feat in missing_features:
133
+ # Use a default value of 0 for missing features
134
+ recent_data[feat] = 0
135
+
136
+ # Handle NaN values
137
+ for feat in feature_names:
138
+ if feat in recent_data.columns and recent_data[feat].isna().any():
139
+ recent_data[feat] = recent_data[feat].fillna(0)
140
+
141
+ # Extract features - make sure to keep only the features the model was trained on
142
+ try:
143
+ features = recent_data[feature_names].values
144
+
145
+ # Convert to float32 for memory efficiency and compatibility
146
+ features = features.astype(np.float32)
147
+
148
+ # Make prediction
149
+ prediction = model.predict(features)[0]
150
+
151
+ # Get the date for prediction
152
+ prediction_date = recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)
153
+
154
+ return prediction, prediction_date.strftime('%Y-%m-%d')
155
+
156
+ except Exception as e:
157
+ print(f"Error making prediction: {e}")
158
+ # Return a reasonable fallback
159
+ return 0, (recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)).strftime('%Y-%m-%d')
160
+
161
+ # Get available datasets
162
+ def get_available_datasets():
163
+ """Get list of available datasets in the current directory"""
164
+ datasets = [f for f in os.listdir() if f.endswith('.csv')]
165
+ return datasets
166
+
167
+ # Gradio interface function
168
+ def predict_covid_cases(model_name, dataset_name, prediction_days):
169
+ """
170
+ Make COVID-19 predictions using the selected model and dataset
171
+
172
+ Parameters:
173
+ -----------
174
+ model_name : str
175
+ Name of the model to use for prediction
176
+ dataset_name : str
177
+ Name of the dataset to use for prediction
178
+ prediction_days : int
179
+ Number of days ahead to predict
180
+
181
+ Returns:
182
+ --------
183
+ Tuple[str, str]
184
+ Prediction results and path to the plot image
185
+ """
186
+ # Load all necessary models and data
187
+ models = load_models()
188
+ if not models:
189
+ return "No trained models available. Please train the models first.", None
190
+
191
+ # Load scaler if available
192
+ scaler = None
193
+ if os.path.exists('scaler.pkl'):
194
+ try:
195
+ scaler = joblib.load('scaler.pkl')
196
+ except Exception as e:
197
+ print(f"Warning: Could not load scaler: {e}")
198
+
199
+ # Get recent data
200
+ try:
201
+ recent_data = get_recent_data(data_file=dataset_name)
202
+ except Exception as e:
203
+ return f"Error loading data from {dataset_name}: {str(e)}", None
204
+
205
+ # Load feature names
206
+ if not os.path.exists('features.txt'):
207
+ return "Features list not found. Please run preprocessing first.", None
208
+
209
+ with open('features.txt', 'r') as f:
210
+ feature_names = [line.strip() for line in f.readlines()]
211
+
212
+ # Make prediction using the selected model
213
+ try:
214
+ prediction, prediction_date = make_prediction(
215
+ models[model_name],
216
+ recent_data,
217
+ feature_names,
218
+ days_to_predict=int(prediction_days)
219
+ )
220
+
221
+ # Create output message
222
+ result = f"## COVID-19 Prediction Results\n\n"
223
+ result += f"### Model: {model_name}\n\n"
224
+ result += f"### Dataset: {dataset_name}\n\n"
225
+ result += f"### Prediction for {prediction_date}:\n"
226
+ result += f"**New confirmed cases: {int(prediction):,}**\n\n"
227
+
228
+ # Current cases for comparison
229
+ latest_date = recent_data['Date'].iloc[0].strftime('%Y-%m-%d')
230
+ latest_cases = recent_data['New_Confirmed'].iloc[0]
231
+ result += f"### Latest data ({latest_date}):\n"
232
+ result += f"**New confirmed cases: {int(latest_cases):,}**\n\n"
233
+
234
+ # Calculate percent change
235
+ percent_change = ((prediction - latest_cases) / latest_cases) * 100
236
+ change_direction = "increase" if percent_change > 0 else "decrease"
237
+ result += f"### This represents a {abs(percent_change):.2f}% {change_direction} from the latest data.\n"
238
+
239
+ # Force garbage collection
240
+ gc.collect()
241
+
242
+ # Add the recent cases plot
243
+ if os.path.exists('recent_cases.png'):
244
+ return result, 'recent_cases.png'
245
+ else:
246
+ return result, None
247
+
248
+ except Exception as e:
249
+ return f"Error making prediction: {str(e)}", None
250
+
251
+ # Create and launch the Gradio interface
252
+ def create_interface():
253
+ """
254
+ Create the Gradio interface for COVID-19 prediction
255
+
256
+ Returns:
257
+ --------
258
+ gr.Blocks
259
+ Gradio interface
260
+ """
261
+ # Load models to get available model names
262
+ models = load_models()
263
+ if not models:
264
+ model_names = ["No models available"]
265
+ else:
266
+ model_names = list(models.keys())
267
+
268
+ # Get available datasets
269
+ datasets = get_available_datasets()
270
+ if not datasets:
271
+ dataset_names = ["No datasets available"]
272
+ else:
273
+ dataset_names = datasets
274
+
275
+ # Create the interface
276
+ with gr.Blocks(title="COVID-19 Prediction Model") as demo:
277
+ gr.Markdown(
278
+ """
279
+ # COVID-19 Case Prediction
280
+
281
+ This application uses regression models to predict future COVID-19 cases.
282
+ Select a model, dataset, and the number of days ahead to predict.
283
+ """
284
+ )
285
+
286
+ with gr.Row():
287
+ with gr.Column():
288
+ model_dropdown = gr.Dropdown(
289
+ choices=model_names,
290
+ label="Select Model",
291
+ value=model_names[0] if model_names else None
292
+ )
293
+
294
+ dataset_dropdown = gr.Dropdown(
295
+ choices=dataset_names,
296
+ label="Select Dataset",
297
+ value="US_engineered_features.csv" if "US_engineered_features.csv" in dataset_names else (dataset_names[0] if dataset_names else None)
298
+ )
299
+
300
+ prediction_days = gr.Slider(
301
+ minimum=1,
302
+ maximum=14,
303
+ value=7,
304
+ step=1,
305
+ label="Days to Predict Ahead"
306
+ )
307
+
308
+ predict_button = gr.Button("Predict")
309
+
310
+ with gr.Column():
311
+ output_text = gr.Markdown("Select a model, dataset, and prediction timeframe, then click 'Predict'")
312
+ output_image = gr.Image(label="Recent Case Trends")
313
+
314
+ predict_button.click(
315
+ fn=predict_covid_cases,
316
+ inputs=[model_dropdown, dataset_dropdown, prediction_days],
317
+ outputs=[output_text, output_image]
318
+ )
319
+
320
+ gr.Markdown(
321
+ """
322
+ ### About the Models
323
+
324
+ - **Random Forest**: A powerful ensemble model that works well with many features
325
+ - **Linear Regression**: A simple but effective baseline model
326
+ - **SVR (Support Vector Regression)**: Good for capturing non-linear relationships
327
+ - **Gradient Boosting**: An ensemble model that builds trees sequentially
328
+
329
+ ### Memory Usage
330
+
331
+ This application is optimized to handle multiple datasets of different types while minimizing memory usage.
332
+ """
333
+ )
334
+
335
+ return demo
336
+
337
+ # Main function to start the Gradio app
338
+ def main():
339
+ """Launch the Gradio app"""
340
+ demo = create_interface()
341
+
342
+ # Configure for both local and Hugging Face deployment
343
+ # For Hugging Face deployment, we need to make sure it's servable publicly
344
+ is_huggingface = os.environ.get('SPACE_ID') is not None
345
+
346
+ if is_huggingface:
347
+ print("Detected Hugging Face environment, configuring for Space deployment")
348
+ # For HF Spaces, specific configuration
349
+ demo.launch(
350
+ server_name="0.0.0.0", # Bind to all interfaces
351
+ share=False, # No need for sharing link in HF
352
+ favicon_path="https://huggingface.co/favicon.ico" # Use HF favicon
353
+ )
354
+ else:
355
+ # Local deployment
356
+ print("Configuring for local deployment")
357
+ demo.launch(
358
+ share=True, # Create a shareable link
359
+ debug=True # Show more error details
360
+ )
361
+
362
+ if __name__ == "__main__":
363
+ main()
hf_space.yml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ sdk: gradio
2
+ sdk_version: 4.10.0
3
+ app_file: gradio_app.py
preprocess_data.py ADDED
@@ -0,0 +1,392 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import numpy as np
3
+ from sklearn.preprocessing import StandardScaler
4
+ from sklearn.model_selection import train_test_split
5
+ import os
6
+ import gc
7
+ from typing import Dict, List, Tuple, Union
8
+
9
+ class COVIDDataProcessor:
10
+ """
11
+ Class to handle preprocessing of COVID-19 data from multiple datasets
12
+ with memory optimization.
13
+ """
14
+ def __init__(self, data_config: Dict = None):
15
+ """
16
+ Initialize the data processor
17
+
18
+ Parameters:
19
+ -----------
20
+ data_config : Dict
21
+ Configuration for loading datasets with column dtypes
22
+ """
23
+ self.data_config = data_config or {}
24
+ self.datasets = {}
25
+ self.X_train = None
26
+ self.X_test = None
27
+ self.y_train = None
28
+ self.y_test = None
29
+ self.feature_cols = None
30
+ self.target_col = None
31
+ self.categorical_cols = []
32
+ self.date_cols = ['Date']
33
+ self.numerical_cols = []
34
+ self.scaler = None
35
+
36
+ @staticmethod
37
+ def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
38
+ """
39
+ Optimize memory usage by converting columns to appropriate dtypes
40
+
41
+ Parameters:
42
+ -----------
43
+ df : pd.DataFrame
44
+ The dataframe to optimize
45
+
46
+ Returns:
47
+ --------
48
+ pd.DataFrame
49
+ Memory-optimized dataframe
50
+ """
51
+ # Convert integer columns to optimal integer type
52
+ for col in df.select_dtypes(include=['int']).columns:
53
+ if df[col].min() >= 0:
54
+ if df[col].max() <= 255:
55
+ df[col] = df[col].astype(np.uint8)
56
+ elif df[col].max() <= 65535:
57
+ df[col] = df[col].astype(np.uint16)
58
+ elif df[col].max() <= 4294967295:
59
+ df[col] = df[col].astype(np.uint32)
60
+ else:
61
+ df[col] = df[col].astype(np.uint64)
62
+ else:
63
+ if df[col].min() >= -128 and df[col].max() <= 127:
64
+ df[col] = df[col].astype(np.int8)
65
+ elif df[col].min() >= -32768 and df[col].max() <= 32767:
66
+ df[col] = df[col].astype(np.int16)
67
+ elif df[col].min() >= -2147483648 and df[col].max() <= 2147483647:
68
+ df[col] = df[col].astype(np.int32)
69
+ else:
70
+ df[col] = df[col].astype(np.int64)
71
+
72
+ # Convert float columns to float32 (usually sufficient precision)
73
+ for col in df.select_dtypes(include=['float']).columns:
74
+ df[col] = df[col].astype(np.float32)
75
+
76
+ # Categorical columns can be converted to 'category' dtype
77
+ for col in df.select_dtypes(include=['object']).columns:
78
+ if col != 'Date' and df[col].nunique() / len(df) < 0.5: # If it's not a date column and has less than 50% unique values
79
+ df[col] = df[col].astype('category')
80
+
81
+ return df
82
+
83
+ def load_dataset(self, name: str, file_path: str, optimize_memory: bool = True) -> pd.DataFrame:
84
+ """
85
+ Load a dataset from file with memory optimization
86
+
87
+ Parameters:
88
+ -----------
89
+ name : str
90
+ Name to identify the dataset
91
+ file_path : str
92
+ Path to the dataset file
93
+ optimize_memory : bool
94
+ Whether to optimize memory usage
95
+
96
+ Returns:
97
+ --------
98
+ pd.DataFrame
99
+ The loaded dataset
100
+ """
101
+ print(f"Loading dataset: {name} from {file_path}")
102
+
103
+ # Get column dtypes if specified in config
104
+ dtype_dict = self.data_config.get(name, {}).get('dtypes', None)
105
+
106
+ # Load with chunk size for large files to avoid memory issues
107
+ if file_path.endswith('.csv'):
108
+ try:
109
+ if dtype_dict:
110
+ df = pd.read_csv(file_path, dtype=dtype_dict)
111
+ else:
112
+ # For large files, read in chunks and concatenate
113
+ chunks = []
114
+ for chunk in pd.read_csv(file_path, chunksize=100000):
115
+ if optimize_memory:
116
+ chunk = self.optimize_dtypes(chunk)
117
+ chunks.append(chunk)
118
+
119
+ df = pd.concat(chunks, axis=0)
120
+ del chunks
121
+ gc.collect()
122
+ except Exception as e:
123
+ print(f"Error loading CSV: {e}")
124
+ return None
125
+ else:
126
+ print(f"Unsupported file format: {file_path}")
127
+ return None
128
+
129
+ # Convert date columns
130
+ if 'Date' in df.columns:
131
+ df['Date'] = pd.to_datetime(df['Date'])
132
+
133
+ # Optimize memory usage
134
+ if optimize_memory and dtype_dict is None:
135
+ df = self.optimize_dtypes(df)
136
+
137
+ # Store dataset
138
+ self.datasets[name] = df
139
+
140
+ print(f"Dataset {name} loaded: {df.shape} - Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
141
+ return df
142
+
143
+ def merge_datasets(self, datasets: List[str], on: List[str], how: str = 'inner') -> pd.DataFrame:
144
+ """
145
+ Merge multiple datasets
146
+
147
+ Parameters:
148
+ -----------
149
+ datasets : List[str]
150
+ List of dataset names to merge
151
+ on : List[str]
152
+ Columns to merge on
153
+ how : str
154
+ Type of merge to perform
155
+
156
+ Returns:
157
+ --------
158
+ pd.DataFrame
159
+ Merged dataset
160
+ """
161
+ if not datasets or len(datasets) < 2:
162
+ print("Need at least two datasets to merge")
163
+ return None
164
+
165
+ # Start with the first dataset
166
+ result = self.datasets[datasets[0]].copy()
167
+
168
+ # Merge with the rest
169
+ for i in range(1, len(datasets)):
170
+ result = result.merge(self.datasets[datasets[i]], on=on, how=how)
171
+ gc.collect() # Force garbage collection to free memory
172
+
173
+ print(f"Merged dataset shape: {result.shape} - Memory usage: {result.memory_usage().sum() / 1024**2:.2f} MB")
174
+ return result
175
+
176
+ def load_data(file_path='US_engineered_features.csv', optimize_memory=True):
177
+ """
178
+ Load and preprocess the COVID-19 data
179
+ """
180
+ # Load the data
181
+ print(f"Loading data from {file_path}...")
182
+
183
+ # For large files, read in chunks
184
+ if optimize_memory:
185
+ chunks = []
186
+ for chunk in pd.read_csv(file_path, chunksize=100000):
187
+ # Optimize dtypes for each chunk
188
+ # Convert integer columns to optimal integer type
189
+ for col in chunk.select_dtypes(include=['int']).columns:
190
+ if chunk[col].min() >= 0:
191
+ if chunk[col].max() <= 255:
192
+ chunk[col] = chunk[col].astype(np.uint8)
193
+ elif chunk[col].max() <= 65535:
194
+ chunk[col] = chunk[col].astype(np.uint16)
195
+ else:
196
+ chunk[col] = chunk[col].astype(np.uint32)
197
+ else:
198
+ if chunk[col].min() >= -128 and chunk[col].max() <= 127:
199
+ chunk[col] = chunk[col].astype(np.int8)
200
+ elif chunk[col].min() >= -32768 and chunk[col].max() <= 32767:
201
+ chunk[col] = chunk[col].astype(np.int16)
202
+ else:
203
+ chunk[col] = chunk[col].astype(np.int32)
204
+
205
+ # Convert float columns to float32
206
+ for col in chunk.select_dtypes(include=['float']).columns:
207
+ chunk[col] = chunk[col].astype(np.float32)
208
+
209
+ chunks.append(chunk)
210
+
211
+ df = pd.concat(chunks, axis=0)
212
+ del chunks
213
+ gc.collect() # Force garbage collection
214
+ else:
215
+ df = pd.read_csv(file_path)
216
+
217
+ # Convert Date to datetime
218
+ if 'Date' in df.columns:
219
+ df['Date'] = pd.to_datetime(df['Date'])
220
+
221
+ # Display basic information
222
+ print(f"Data shape: {df.shape}")
223
+ if 'Date' in df.columns:
224
+ print(f"Time period: {df['Date'].min()} to {df['Date'].max()}")
225
+
226
+ return df
227
+
228
+ def preprocess_data(df, target_column='New_Confirmed', prediction_days=7, test_size=0.2):
229
+ """
230
+ Preprocess the data for regression modeling
231
+
232
+ Parameters:
233
+ - df: DataFrame containing the COVID-19 data
234
+ - target_column: The column to predict
235
+ - prediction_days: Number of days ahead to predict
236
+ - test_size: Proportion of data to use for testing
237
+
238
+ Returns:
239
+ - X_train, X_test, y_train, y_test: Train and test sets
240
+ - feature_names: Names of the features used for prediction
241
+ - scaler: The fitted scaler for inverse transformations
242
+ """
243
+ # Convert Date to datetime if not already
244
+ if 'Date' in df.columns and not pd.api.types.is_datetime64_any_dtype(df['Date']):
245
+ df['Date'] = pd.to_datetime(df['Date'])
246
+
247
+ # Create a shifted target column for prediction
248
+ df[f'{target_column}_future_{prediction_days}d'] = df[target_column].shift(-prediction_days)
249
+
250
+ # Drop rows with NaN values (typically the last n rows where future data is not available)
251
+ df = df.dropna(subset=[f'{target_column}_future_{prediction_days}d'])
252
+
253
+ # Remove non-numeric columns and columns that would cause data leakage
254
+ non_feature_cols = ['Date', 'Country/Region', f'{target_column}_future_{prediction_days}d']
255
+ leakage_cols = [col for col in df.columns if 'future' in col and col != f'{target_column}_future_{prediction_days}d']
256
+
257
+ # For regression models, we'll use all available numeric features
258
+ features = df.columns.difference(non_feature_cols + leakage_cols)
259
+
260
+ # Select features and target
261
+ X = df[features]
262
+ y = df[f'{target_column}_future_{prediction_days}d']
263
+
264
+ # Fill missing values with median for numerical features or mode for categorical
265
+ for col in X.columns:
266
+ if X[col].isna().sum() > 0:
267
+ if np.issubdtype(X[col].dtype, np.number):
268
+ X[col] = X[col].fillna(X[col].median())
269
+ else:
270
+ X[col] = X[col].fillna(X[col].mode()[0])
271
+
272
+ # Split data into training and testing sets
273
+ X_train, X_test, y_train, y_test = train_test_split(
274
+ X, y, test_size=test_size, shuffle=False
275
+ )
276
+
277
+ # Convert pandas DataFrames to NumPy arrays to save memory
278
+ X_train_np = X_train.values
279
+ X_test_np = X_test.values
280
+
281
+ # Scale the features
282
+ scaler = StandardScaler()
283
+ X_train_scaled = scaler.fit_transform(X_train_np)
284
+ X_test_scaled = scaler.transform(X_test_np)
285
+
286
+ # Release memory
287
+ del X_train_np, X_test_np
288
+ gc.collect()
289
+
290
+ print(f"Training set shape: {X_train.shape}")
291
+ print(f"Testing set shape: {X_test.shape}")
292
+ print(f"Features used: {len(features)}")
293
+
294
+ return X_train_scaled, X_test_scaled, y_train, y_test, list(features), scaler
295
+
296
+ def main():
297
+ """
298
+ Main function to demonstrate data preprocessing
299
+ """
300
+ # Define data configuration for memory optimization
301
+ data_config = {
302
+ 'covid_full': {
303
+ 'dtypes': {
304
+ 'Country/Region': 'category',
305
+ 'Confirmed': 'int32',
306
+ 'Deaths': 'int32',
307
+ 'Recovered': 'int32',
308
+ 'New_Confirmed': 'int32',
309
+ 'New_Deaths': 'int32',
310
+ 'New_Recovered': 'int32',
311
+ }
312
+ },
313
+ 'us_engineered': {
314
+ 'dtypes': {
315
+ 'Country/Region': 'category',
316
+ 'Confirmed': 'int32',
317
+ 'Deaths': 'int32',
318
+ 'New_Confirmed': 'int32',
319
+ 'New_Deaths': 'int32',
320
+ 'Day': 'int8',
321
+ 'Day_of_week': 'int8',
322
+ 'Month': 'int8',
323
+ 'Year': 'int16',
324
+ }
325
+ }
326
+ }
327
+
328
+ # Choose the approach:
329
+ # 1. Basic approach (backward compatible)
330
+ # 2. Advanced multi-dataset approach
331
+ approach = 2 # Change to 1 for basic approach
332
+
333
+ if approach == 1:
334
+ # Basic approach (legacy code)
335
+ print("Using basic approach...")
336
+ df = load_data(optimize_memory=True)
337
+ X_train, X_test, y_train, y_test, features, scaler = preprocess_data(df)
338
+ else:
339
+ # Advanced multi-dataset approach
340
+ print("Using advanced multi-dataset approach...")
341
+ processor = COVIDDataProcessor(data_config)
342
+
343
+ # Load datasets
344
+ processor.load_dataset('covid_full', 'covid_full_dataset.csv')
345
+ processor.load_dataset('us_engineered', 'US_engineered_features.csv')
346
+
347
+ # Example: You can also load other datasets and merge them
348
+ # processor.load_dataset('raw_confirmed', 'raw_confirmed.csv')
349
+ # processor.load_dataset('raw_deaths', 'raw_deaths.csv')
350
+ # processor.merge_datasets(['raw_confirmed', 'raw_deaths'], on=['Date', 'Country/Region'])
351
+
352
+ # For simplicity, we'll just use the US engineered features dataset
353
+ # But you could use any merged or single dataset here
354
+ X_train, X_test, y_train, y_test, features, scaler = preprocess_data(
355
+ processor.datasets['us_engineered']
356
+ )
357
+
358
+ print("\nPreprocessing complete!")
359
+ print(f"Number of training samples: {len(X_train)}")
360
+ print(f"Number of testing samples: {len(X_test)}")
361
+ print(f"Target range: {y_train.min()} to {y_train.max()}")
362
+
363
+ # Save the preprocessed data - using numpy's compressed format to save space
364
+ np.savez_compressed('X_train.npz', data=X_train)
365
+ np.savez_compressed('X_test.npz', data=X_test)
366
+ np.savez_compressed('y_train.npz', data=y_train)
367
+ np.savez_compressed('y_test.npz', data=y_test)
368
+
369
+ # Also save as .npy for backward compatibility
370
+ np.save('X_train.npy', X_train)
371
+ np.save('X_test.npy', X_test)
372
+ np.save('y_train.npy', y_train if isinstance(y_train, np.ndarray) else y_train.values)
373
+ np.save('y_test.npy', y_test if isinstance(y_test, np.ndarray) else y_test.values)
374
+
375
+ # Save features list
376
+ with open('features.txt', 'w') as f:
377
+ for feature in features:
378
+ f.write(f"{feature}\n")
379
+
380
+ # Save scaler
381
+ import joblib
382
+ joblib.dump(scaler, 'scaler.pkl')
383
+
384
+ print("Preprocessed data saved!")
385
+
386
+ # Print memory usage stats
387
+ import psutil
388
+ process = psutil.Process(os.getpid())
389
+ print(f"Current memory usage: {process.memory_info().rss / (1024 * 1024):.2f} MB")
390
+
391
+ if __name__ == "__main__":
392
+ main()
raw_confirmed.csv ADDED
The diff for this file is too large to render. See raw diff
 
raw_deaths.csv ADDED
The diff for this file is too large to render. See raw diff
 
raw_owid.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2c25df2d17be38533d88d9b714d80050f615519ab39f1ef5831ae38db7dff46
3
+ size 107886083
raw_recovered.csv ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas==2.1.0
2
+ numpy==1.26.0
3
+ matplotlib==3.8.0
4
+ seaborn==0.13.0
5
+ scikit-learn==1.3.0
6
+ xgboost==2.0.0
7
+ lightgbm==4.1.0
8
+ shap==0.43.0
9
+ gradio==4.10.0
10
+ joblib==1.3.2
run_pipeline.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ COVID-19 Prediction Pipeline
4
+ ---------------------------
5
+ This script runs the complete pipeline from data preprocessing to model training
6
+ and launches the Gradio UI.
7
+ """
8
+
9
+ import os
10
+ import argparse
11
+ import subprocess
12
+ import time
13
+
14
+ def clear_screen():
15
+ """Clear the terminal screen"""
16
+ os.system('cls' if os.name == 'nt' else 'clear')
17
+
18
+ def run_command(command, description):
19
+ """Run a system command and print output"""
20
+ print(f"\n{'=' * 80}")
21
+ print(f"STEP: {description}")
22
+ print(f"{'=' * 80}\n")
23
+ print(f"Running: {command}\n")
24
+
25
+ # Run the command
26
+ start_time = time.time()
27
+ result = subprocess.run(command, shell=True)
28
+ end_time = time.time()
29
+
30
+ # Check if command was successful
31
+ if result.returncode == 0:
32
+ print(f"\nSuccess! Completed in {end_time - start_time:.2f} seconds")
33
+ else:
34
+ print(f"\nError! Command failed with exit code {result.returncode}")
35
+ exit(1)
36
+
37
+ print(f"\n{'=' * 80}\n")
38
+ time.sleep(1)
39
+
40
+ def parse_args():
41
+ """Parse command-line arguments"""
42
+ parser = argparse.ArgumentParser(description="COVID-19 Prediction Pipeline")
43
+ parser.add_argument("--skip-preprocessing", action="store_true", help="Skip data preprocessing")
44
+ parser.add_argument("--skip-training", action="store_true", help="Skip model training")
45
+ parser.add_argument("--only-ui", action="store_true", help="Only launch the Gradio UI")
46
+ return parser.parse_args()
47
+
48
+ def main():
49
+ """Run the complete pipeline"""
50
+ args = parse_args()
51
+
52
+ # Display welcome banner
53
+ clear_screen()
54
+ print("\n" + "=" * 80)
55
+ print("COVID-19 PREDICTION PIPELINE".center(80))
56
+ print("=" * 80 + "\n")
57
+ print("This script will run the complete pipeline:")
58
+ print("1. Data preprocessing")
59
+ print("2. Model training")
60
+ print("3. Launch Gradio UI for predictions")
61
+ print("\nPress Ctrl+C at any time to stop the pipeline.")
62
+ print()
63
+
64
+ try:
65
+ # Step 1: Data Preprocessing
66
+ if args.only_ui:
67
+ print("Skipping preprocessing and training, launching UI only...")
68
+ else:
69
+ if not args.skip_preprocessing:
70
+ run_command("python preprocess_data.py", "Data Preprocessing")
71
+ else:
72
+ print("Skipping preprocessing as requested.")
73
+
74
+ # Step 2: Model Training
75
+ if not args.skip_training:
76
+ run_command("python train_models.py", "Model Training")
77
+ else:
78
+ print("Skipping model training as requested.")
79
+
80
+ # Step 3: Launch Gradio UI
81
+ print("\nLaunching Gradio UI for predictions...")
82
+ run_command("python gradio_app.py", "Gradio UI Launch")
83
+
84
+ except KeyboardInterrupt:
85
+ print("\n\nPipeline interrupted by user. Exiting.")
86
+ exit(0)
87
+
88
+ except Exception as e:
89
+ print(f"\n\nError in pipeline: {str(e)}")
90
+ exit(1)
91
+
92
+ if __name__ == "__main__":
93
+ main()
train_models.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import pandas as pd
3
+ from sklearn.linear_model import LinearRegression
4
+ from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
5
+ from sklearn.svm import SVR
6
+ from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
7
+ import matplotlib.pyplot as plt
8
+ import seaborn as sns
9
+ import joblib
10
+ import os
11
+ import gc
12
+ import psutil
13
+ from typing import Dict, List, Tuple
14
+
15
+ # Set the style for plots
16
+ sns.set(style="whitegrid")
17
+
18
+ # Set up memory monitoring
19
+ def print_memory_usage():
20
+ process = psutil.Process(os.getpid())
21
+ memory_usage = process.memory_info().rss / (1024 * 1024) # Convert to MB
22
+ print(f"Current memory usage: {memory_usage:.2f} MB")
23
+
24
+ def train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names=None):
25
+ """
26
+ Train and evaluate multiple regression models for COVID-19 prediction
27
+
28
+ Parameters:
29
+ - X_train, X_test, y_train, y_test: Training and testing data
30
+ - feature_names: List of feature names (for feature importance)
31
+
32
+ Returns:
33
+ - models: Dictionary of trained models
34
+ - metrics: Dictionary of evaluation metrics for each model
35
+ """
36
+ models = {
37
+ 'Linear Regression': LinearRegression(),
38
+ 'Support Vector Regression': SVR(kernel='rbf', gamma='scale', C=1.0, epsilon=0.1),
39
+ 'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
40
+ 'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
41
+ }
42
+
43
+ metrics = {
44
+ 'Model': [],
45
+ 'RMSE': [],
46
+ 'MAE': [],
47
+ 'R²': []
48
+ }
49
+
50
+ for name, model in models.items():
51
+ print(f"Training {name}...")
52
+ model.fit(X_train, y_train)
53
+
54
+ # Predict
55
+ y_pred = model.predict(X_test)
56
+
57
+ # Calculate metrics
58
+ rmse = np.sqrt(mean_squared_error(y_test, y_pred))
59
+ mae = mean_absolute_error(y_test, y_pred)
60
+ r2 = r2_score(y_test, y_pred)
61
+
62
+ # Store metrics
63
+ metrics['Model'].append(name)
64
+ metrics['RMSE'].append(rmse)
65
+ metrics['MAE'].append(mae)
66
+ metrics['R²'].append(r2)
67
+
68
+ print(f"{name} - RMSE: {rmse:.2f}, MAE: {mae:.2f}, R²: {r2:.4f}")
69
+
70
+ # Save the model
71
+ joblib.dump(model, f'{name.replace(" ", "_").lower()}_model.pkl')
72
+
73
+ # Plot actual vs predicted
74
+ plt.figure(figsize=(10, 6))
75
+ plt.scatter(y_test, y_pred, alpha=0.5)
76
+ plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
77
+ plt.title(f'{name} - Actual vs Predicted')
78
+ plt.xlabel('Actual')
79
+ plt.ylabel('Predicted')
80
+ plt.savefig(f'{name.replace(" ", "_").lower()}_predictions.png')
81
+
82
+ # If it's Random Forest or Gradient Boosting, plot feature importance
83
+ if name in ['Random Forest', 'Gradient Boosting'] and feature_names is not None:
84
+ plt.figure(figsize=(12, 8))
85
+ feature_importance = model.feature_importances_
86
+ sorted_idx = np.argsort(feature_importance)
87
+
88
+ # Select top 15 features for better visualization
89
+ top_k = min(15, len(feature_importance))
90
+ plt.barh(range(top_k), feature_importance[sorted_idx][-top_k:])
91
+ plt.yticks(range(top_k), [feature_names[i] for i in sorted_idx[-top_k:]])
92
+ plt.title(f'{name} - Top {top_k} Feature Importance')
93
+ plt.tight_layout()
94
+ plt.savefig(f'{name.replace(" ", "_").lower()}_feature_importance.png')
95
+
96
+ # Plot comparison of models
97
+ metrics_df = pd.DataFrame(metrics)
98
+
99
+ # Create bar plot for RMSE and MAE
100
+ plt.figure(figsize=(12, 6))
101
+
102
+ bar_width = 0.35
103
+ index = np.arange(len(metrics_df['Model']))
104
+
105
+ plt.bar(index, metrics_df['RMSE'], bar_width, label='RMSE')
106
+ plt.bar(index + bar_width, metrics_df['MAE'], bar_width, label='MAE')
107
+
108
+ plt.xlabel('Model')
109
+ plt.ylabel('Error')
110
+ plt.title('Model Comparison - RMSE and MAE')
111
+ plt.xticks(index + bar_width / 2, metrics_df['Model'], rotation=45)
112
+ plt.legend()
113
+ plt.tight_layout()
114
+ plt.savefig('model_comparison_error.png')
115
+
116
+ # Create bar plot for R²
117
+ plt.figure(figsize=(12, 6))
118
+ plt.bar(metrics_df['Model'], metrics_df['R²'], color='skyblue')
119
+ plt.xlabel('Model')
120
+ plt.ylabel('R²')
121
+ plt.title('Model Comparison - R²')
122
+ plt.xticks(rotation=45)
123
+ plt.tight_layout()
124
+ plt.savefig('model_comparison_r2.png')
125
+
126
+ print("\nModel training and evaluation complete!")
127
+ print(f"Models saved as: {', '.join([f'{name.replace(' ', '_').lower()}_model.pkl' for name in models.keys()])}")
128
+
129
+ return models, metrics_df
130
+
131
+ def main():
132
+ """
133
+ Main function to train and evaluate models
134
+ """
135
+ # Check if preprocessed data exists
136
+ if not all(os.path.exists(f) for f in ['X_train.npy', 'X_test.npy', 'y_train.npy', 'y_test.npy']):
137
+ print("Preprocessed data not found. Please run preprocess_data.py first.")
138
+ return
139
+
140
+ # Load preprocessed data
141
+ X_train = np.load('X_train.npy')
142
+ X_test = np.load('X_test.npy')
143
+ y_train = np.load('y_train.npy')
144
+ y_test = np.load('y_test.npy')
145
+
146
+ # Load feature names
147
+ feature_names = []
148
+ if os.path.exists('features.txt'):
149
+ with open('features.txt', 'r') as f:
150
+ feature_names = [line.strip() for line in f.readlines()]
151
+
152
+ print("Data loaded successfully!")
153
+ print(f"Training data shape: {X_train.shape}")
154
+ print(f"Testing data shape: {X_test.shape}")
155
+
156
+ # Train and evaluate models
157
+ models, metrics = train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names)
158
+
159
+ # Display and save comparison table
160
+ print("\nModel Comparison:")
161
+ print(metrics)
162
+ metrics.to_csv('model_comparison.csv', index=False)
163
+
164
+ if __name__ == "__main__":
165
+ main()