Spaces:

vishwak1
/

disease_prediction

Configuration error

App Files Files Community

vishwak1 commited on May 26

Commit

fb61aba

verified ·

1 Parent(s): f0dd4ab

Upload 14 files

Browse files

Files changed (15) hide show

.gitattributes +3 -0
README.md +119 -13
US_engineered_features.csv +0 -0
covid_country_ts.csv +3 -0
covid_full_dataset.csv +3 -0
gradio_app.py +363 -0
hf_space.yml +3 -0
preprocess_data.py +392 -0
raw_confirmed.csv +0 -0
raw_deaths.csv +0 -0
raw_owid.csv +3 -0
raw_recovered.csv +0 -0
requirements.txt +10 -0
run_pipeline.py +93 -0
train_models.py +165 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+covid_country_ts.csv filter=lfs diff=lfs merge=lfs -text
+covid_full_dataset.csv filter=lfs diff=lfs merge=lfs -text
+raw_owid.csv filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,119 @@
----
-title: Disease Prediction
-emoji: 🌖
-colorFrom: blue
-colorTo: red
-sdk: gradio
-sdk_version: 5.31.0
-app_file: app.py
-pinned: false
-short_description: project
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# COVID-19 Prediction Model
+This project implements a COVID-19 prediction system using regression models with a focus on Random Forest and three other regression models. The system includes a Gradio user interface for Hugging Face deployment.
+## Features
+- **Memory-optimized** data processing that can handle multiple datasets of different types and object types
+- **Multiple regression models** for comparison:
+  - Random Forest Regression
+  - Linear Regression
+  - Support Vector Regression (SVR)
+  - Gradient Boosting Regression
+- **Gradio UI** for easy model selection, visualization, and deployment to Hugging Face Spaces
+- Complete **data preprocessing pipeline** with feature engineering
+- **Performance evaluation** metrics and visualization
+## Project Structure
+```
+COVID-19-Prediction/
+├── covid_full_dataset.csv       # Complete COVID-19 dataset
+├── US_engineered_features.csv   # Engineered features for US data
+├── raw_confirmed.csv            # Raw confirmed cases data
+├── raw_deaths.csv               # Raw deaths data
+├── raw_recovered.csv            # Raw recovered cases data
+├── raw_owid.csv                 # Additional data from Our World in Data
+├── covid_country_ts.csv         # Country-level time series data
+├── preprocess_data.py           # Data preprocessing script
+├── train_models.py              # Model training script
+├── gradio_app.py                # Gradio UI for predictions
+├── run_pipeline.py              # Complete pipeline runner
+└── requirements.txt             # Project dependencies
+```
+## Installation
+1. Clone this repository:
+   ```
+   git clone https://github.com/yourusername/covid19-prediction.git
+   cd covid19-prediction
+   ```
+2. Install the required packages:
+   ```
+   pip install -r requirements.txt
+   ```
+## Usage
+### Run the Complete Pipeline
+To run the complete pipeline (preprocessing, training, and UI):
+```
+python run_pipeline.py
+```
+### Pipeline Options
+- Skip preprocessing: `python run_pipeline.py --skip-preprocessing`
+- Skip training: `python run_pipeline.py --skip-training`
+- Only launch UI: `python run_pipeline.py --only-ui`
+### Run Individual Steps
+1. **Data Preprocessing**:
+   ```
+   python preprocess_data.py
+   ```
+2. **Model Training**:
+   ```
+   python train_models.py
+   ```
+3. **Launch Gradio UI**:
+   ```
+   python gradio_app.py
+   ```
+## Memory Optimization
+This project is optimized to handle large datasets efficiently:
+- Uses appropriate data types to minimize memory footprint
+- Processes data in chunks for large files
+- Employs garbage collection to free memory
+- Uses compressed NumPy formats for storing processed data
+- Optimizes model parameters for memory efficiency
+## Models
+The project implements and compares four regression models:
+1. **Random Forest Regressor**: An ensemble learning method that builds multiple decision trees and merges their predictions.
+2. **Linear Regression**: A simple baseline model that assumes a linear relationship between features and target.
+3. **Support Vector Regression (SVR)**: Uses support vectors to create a regression model that can capture non-linear relationships.
+4. **Gradient Boosting Regressor**: An ensemble technique that builds trees sequentially, with each tree correcting errors made by previous ones.
+## Hugging Face Deployment
+The Gradio interface is configured for easy deployment to Hugging Face Spaces:
+1. Create a new Space on Hugging Face
+2. Upload all files to the Space
+3. The app will automatically configure for the Hugging Face environment
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Acknowledgments
+- Data sources: Johns Hopkins CSSE, Our World in Data
+- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Gradio

US_engineered_features.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

covid_country_ts.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1af870f40973ef368b35e81c247ffde415e73db91bf80764412a17d86847474d
+size 19035987

covid_full_dataset.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:382205dd481f9868e30e5d3574714061e9e9c492015061934d5bc92176cbed04
+size 44431554

gradio_app.py ADDED Viewed

	@@ -0,0 +1,363 @@

+import gradio as gr
+import pandas as pd
+import numpy as np
+import joblib
+from datetime import datetime, timedelta
+import os
+import matplotlib.pyplot as plt
+import gc
+from typing import Dict, List, Tuple, Union, Any
+# Load the models and scaler
+def load_models():
+    models = {}
+    model_files = [f for f in os.listdir() if f.endswith('_model.pkl')]
+    if not model_files:
+        print("No trained models found. Please run train_models.py first.")
+        return None
+    for model_file in model_files:
+        model_name = ' '.join([word.capitalize() for word in model_file.replace('_model.pkl', '').split('_')])
+        models[model_name] = joblib.load(model_file)
+    print(f"Loaded {len(models)} models: {', '.join(models.keys())}")
+    return models
+# Create a function to get most recent data for prediction
+def get_recent_data(data_file='US_engineered_features.csv', rows=30, optimize_memory=True):
+    """
+    Load and process recent data for prediction
+    Parameters:
+    -----------
+    data_file : str
+        Path to the data file
+    rows : int
+        Number of rows to retrieve from the end of the dataset
+    optimize_memory : bool
+        Whether to optimize memory usage
+    Returns:
+    --------
+    pd.DataFrame
+        Recent data sorted by date
+    """
+    # For memory optimization, define dtypes for critical columns
+    if optimize_memory:
+        dtype_dict = {
+            'New_Confirmed': 'int32',
+            'Deaths': 'int32',
+            'Confirmed': 'int32',
+            'Country/Region': 'category'
+        }
+        # Read only necessary columns if file is large
+        try:
+            # First check the file size
+            file_size = os.path.getsize(data_file) / (1024 * 1024)  # Size in MB
+            if file_size > 100:  # If file is larger than 100MB
+                # Get column list first
+                df_cols = pd.read_csv(data_file, nrows=1).columns.tolist()
+                # Define essential columns for prediction
+                essential_cols = ['Date', 'Country/Region', 'New_Confirmed', 'Deaths', 'Confirmed',
+                                 'Recovered', 'New_Deaths', 'New_Recovered',
+                                 'population', 'population_density', 'median_age']
+                # Filter to columns that exist in the dataset
+                cols_to_use = [col for col in essential_cols if col in df_cols]
+                # Read only the tail portion of the file for efficiency
+                df = pd.read_csv(data_file,
+                                 usecols=cols_to_use,
+                                 dtype={col: dtype_dict.get(col, None) for col in cols_to_use if col in dtype_dict})
+            else:
+                df = pd.read_csv(data_file, dtype=dtype_dict)
+        except Exception as e:
+            print(f"Error optimizing data load: {e}")
+            # Fall back to standard loading
+            df = pd.read_csv(data_file)
+    else:
+        df = pd.read_csv(data_file)
+    # Convert Date to datetime
+    df['Date'] = pd.to_datetime(df['Date'])
+    # Sort and get recent data
+    df = df.sort_values('Date', ascending=False).head(rows)
+    # Create a plot of recent confirmed cases
+    plt.figure(figsize=(10, 6))
+    plt.plot(df['Date'], df['New_Confirmed'], marker='o')
+    plt.title('Recent New Confirmed COVID-19 Cases')
+    plt.xlabel('Date')
+    plt.ylabel('New Confirmed Cases')
+    plt.xticks(rotation=45)
+    plt.tight_layout()
+    plt.savefig('recent_cases.png')
+    plt.close()  # Close to free memory
+    return df
+# Function to create predictions
+def make_prediction(model, data, feature_names, days_to_predict=7):
+    """
+    Make prediction using the selected model and data
+    Parameters:
+    -----------
+    model : object
+        Trained model with predict method
+    data : pd.DataFrame
+        Data to make prediction on
+    feature_names : List[str]
+        Names of the features used for prediction
+    days_to_predict : int
+        Number of days ahead to predict
+    Returns:
+    --------
+    Tuple[float, str]
+        Prediction value and prediction date
+    """
+    # Get the most recent row
+    recent_data = data.iloc[0:1]
+    # Handle missing features - fill with median/mode values if needed
+    missing_features = [f for f in feature_names if f not in recent_data.columns]
+    if missing_features:
+        print(f"Warning: {len(missing_features)} features are missing from the dataset and will be filled with defaults")
+        for feat in missing_features:
+            # Use a default value of 0 for missing features
+            recent_data[feat] = 0
+    # Handle NaN values
+    for feat in feature_names:
+        if feat in recent_data.columns and recent_data[feat].isna().any():
+            recent_data[feat] = recent_data[feat].fillna(0)
+    # Extract features - make sure to keep only the features the model was trained on
+    try:
+        features = recent_data[feature_names].values
+        # Convert to float32 for memory efficiency and compatibility
+        features = features.astype(np.float32)
+        # Make prediction
+        prediction = model.predict(features)[0]
+        # Get the date for prediction
+        prediction_date = recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)
+        return prediction, prediction_date.strftime('%Y-%m-%d')
+    except Exception as e:
+        print(f"Error making prediction: {e}")
+        # Return a reasonable fallback
+        return 0, (recent_data['Date'].iloc[0] + timedelta(days=days_to_predict)).strftime('%Y-%m-%d')
+# Get available datasets
+def get_available_datasets():
+    """Get list of available datasets in the current directory"""
+    datasets = [f for f in os.listdir() if f.endswith('.csv')]
+    return datasets
+# Gradio interface function
+def predict_covid_cases(model_name, dataset_name, prediction_days):
+    """
+    Make COVID-19 predictions using the selected model and dataset
+    Parameters:
+    -----------
+    model_name : str
+        Name of the model to use for prediction
+    dataset_name : str
+        Name of the dataset to use for prediction
+    prediction_days : int
+        Number of days ahead to predict
+    Returns:
+    --------
+    Tuple[str, str]
+        Prediction results and path to the plot image
+    """
+    # Load all necessary models and data
+    models = load_models()
+    if not models:
+        return "No trained models available. Please train the models first.", None
+    # Load scaler if available
+    scaler = None
+    if os.path.exists('scaler.pkl'):
+        try:
+            scaler = joblib.load('scaler.pkl')
+        except Exception as e:
+            print(f"Warning: Could not load scaler: {e}")
+    # Get recent data
+    try:
+        recent_data = get_recent_data(data_file=dataset_name)
+    except Exception as e:
+        return f"Error loading data from {dataset_name}: {str(e)}", None
+    # Load feature names
+    if not os.path.exists('features.txt'):
+        return "Features list not found. Please run preprocessing first.", None
+    with open('features.txt', 'r') as f:
+        feature_names = [line.strip() for line in f.readlines()]
+    # Make prediction using the selected model
+    try:
+        prediction, prediction_date = make_prediction(
+            models[model_name],
+            recent_data,
+            feature_names,
+            days_to_predict=int(prediction_days)
+        )
+        # Create output message
+        result = f"## COVID-19 Prediction Results\n\n"
+        result += f"### Model: {model_name}\n\n"
+        result += f"### Dataset: {dataset_name}\n\n"
+        result += f"### Prediction for {prediction_date}:\n"
+        result += f"**New confirmed cases: {int(prediction):,}**\n\n"
+        # Current cases for comparison
+        latest_date = recent_data['Date'].iloc[0].strftime('%Y-%m-%d')
+        latest_cases = recent_data['New_Confirmed'].iloc[0]
+        result += f"### Latest data ({latest_date}):\n"
+        result += f"**New confirmed cases: {int(latest_cases):,}**\n\n"
+        # Calculate percent change
+        percent_change = ((prediction - latest_cases) / latest_cases) * 100
+        change_direction = "increase" if percent_change > 0 else "decrease"
+        result += f"### This represents a {abs(percent_change):.2f}% {change_direction} from the latest data.\n"
+        # Force garbage collection
+        gc.collect()
+        # Add the recent cases plot
+        if os.path.exists('recent_cases.png'):
+            return result, 'recent_cases.png'
+        else:
+            return result, None
+    except Exception as e:
+        return f"Error making prediction: {str(e)}", None
+# Create and launch the Gradio interface
+def create_interface():
+    """
+    Create the Gradio interface for COVID-19 prediction
+    Returns:
+    --------
+    gr.Blocks
+        Gradio interface
+    """
+    # Load models to get available model names
+    models = load_models()
+    if not models:
+        model_names = ["No models available"]
+    else:
+        model_names = list(models.keys())
+    # Get available datasets
+    datasets = get_available_datasets()
+    if not datasets:
+        dataset_names = ["No datasets available"]
+    else:
+        dataset_names = datasets
+    # Create the interface
+    with gr.Blocks(title="COVID-19 Prediction Model") as demo:
+        gr.Markdown(
+        """
+        # COVID-19 Case Prediction
+        This application uses regression models to predict future COVID-19 cases.
+        Select a model, dataset, and the number of days ahead to predict.
+        """
+        )
+        with gr.Row():
+            with gr.Column():
+                model_dropdown = gr.Dropdown(
+                    choices=model_names,
+                    label="Select Model",
+                    value=model_names[0] if model_names else None
+                )
+                dataset_dropdown = gr.Dropdown(
+                    choices=dataset_names,
+                    label="Select Dataset",
+                    value="US_engineered_features.csv" if "US_engineered_features.csv" in dataset_names else (dataset_names[0] if dataset_names else None)
+                )
+                prediction_days = gr.Slider(
+                    minimum=1,
+                    maximum=14,
+                    value=7,
+                    step=1,
+                    label="Days to Predict Ahead"
+                )
+                predict_button = gr.Button("Predict")
+            with gr.Column():
+                output_text = gr.Markdown("Select a model, dataset, and prediction timeframe, then click 'Predict'")
+                output_image = gr.Image(label="Recent Case Trends")
+        predict_button.click(
+            fn=predict_covid_cases,
+            inputs=[model_dropdown, dataset_dropdown, prediction_days],
+            outputs=[output_text, output_image]
+        )
+        gr.Markdown(
+        """
+        ### About the Models
+        - **Random Forest**: A powerful ensemble model that works well with many features
+        - **Linear Regression**: A simple but effective baseline model
+        - **SVR (Support Vector Regression)**: Good for capturing non-linear relationships
+        - **Gradient Boosting**: An ensemble model that builds trees sequentially
+        ### Memory Usage
+        This application is optimized to handle multiple datasets of different types while minimizing memory usage.
+        """
+        )
+    return demo
+# Main function to start the Gradio app
+def main():
+    """Launch the Gradio app"""
+    demo = create_interface()
+    # Configure for both local and Hugging Face deployment
+    # For Hugging Face deployment, we need to make sure it's servable publicly
+    is_huggingface = os.environ.get('SPACE_ID') is not None
+    if is_huggingface:
+        print("Detected Hugging Face environment, configuring for Space deployment")
+        # For HF Spaces, specific configuration
+        demo.launch(
+            server_name="0.0.0.0",  # Bind to all interfaces
+            share=False,            # No need for sharing link in HF
+            favicon_path="https://huggingface.co/favicon.ico" # Use HF favicon
+        )
+    else:
+        # Local deployment
+        print("Configuring for local deployment")
+        demo.launch(
+            share=True,  # Create a shareable link
+            debug=True   # Show more error details
+        )
+if __name__ == "__main__":
+    main()

hf_space.yml ADDED Viewed

	@@ -0,0 +1,3 @@

+sdk: gradio
+sdk_version: 4.10.0
+app_file: gradio_app.py

preprocess_data.py ADDED Viewed

	@@ -0,0 +1,392 @@

+import pandas as pd
+import numpy as np
+from sklearn.preprocessing import StandardScaler
+from sklearn.model_selection import train_test_split
+import os
+import gc
+from typing import Dict, List, Tuple, Union
+class COVIDDataProcessor:
+    """
+    Class to handle preprocessing of COVID-19 data from multiple datasets
+    with memory optimization.
+    """
+    def __init__(self, data_config: Dict = None):
+        """
+        Initialize the data processor
+        Parameters:
+        -----------
+        data_config : Dict
+            Configuration for loading datasets with column dtypes
+        """
+        self.data_config = data_config or {}
+        self.datasets = {}
+        self.X_train = None
+        self.X_test = None
+        self.y_train = None
+        self.y_test = None
+        self.feature_cols = None
+        self.target_col = None
+        self.categorical_cols = []
+        self.date_cols = ['Date']
+        self.numerical_cols = []
+        self.scaler = None
+    @staticmethod
+    def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Optimize memory usage by converting columns to appropriate dtypes
+        Parameters:
+        -----------
+        df : pd.DataFrame
+            The dataframe to optimize
+        Returns:
+        --------
+        pd.DataFrame
+            Memory-optimized dataframe
+        """
+        # Convert integer columns to optimal integer type
+        for col in df.select_dtypes(include=['int']).columns:
+            if df[col].min() >= 0:
+                if df[col].max() <= 255:
+                    df[col] = df[col].astype(np.uint8)
+                elif df[col].max() <= 65535:
+                    df[col] = df[col].astype(np.uint16)
+                elif df[col].max() <= 4294967295:
+                    df[col] = df[col].astype(np.uint32)
+                else:
+                    df[col] = df[col].astype(np.uint64)
+            else:
+                if df[col].min() >= -128 and df[col].max() <= 127:
+                    df[col] = df[col].astype(np.int8)
+                elif df[col].min() >= -32768 and df[col].max() <= 32767:
+                    df[col] = df[col].astype(np.int16)
+                elif df[col].min() >= -2147483648 and df[col].max() <= 2147483647:
+                    df[col] = df[col].astype(np.int32)
+                else:
+                    df[col] = df[col].astype(np.int64)
+        # Convert float columns to float32 (usually sufficient precision)
+        for col in df.select_dtypes(include=['float']).columns:
+            df[col] = df[col].astype(np.float32)
+        # Categorical columns can be converted to 'category' dtype
+        for col in df.select_dtypes(include=['object']).columns:
+            if col != 'Date' and df[col].nunique() / len(df) < 0.5:  # If it's not a date column and has less than 50% unique values
+                df[col] = df[col].astype('category')
+        return df
+    def load_dataset(self, name: str, file_path: str, optimize_memory: bool = True) -> pd.DataFrame:
+        """
+        Load a dataset from file with memory optimization
+        Parameters:
+        -----------
+        name : str
+            Name to identify the dataset
+        file_path : str
+            Path to the dataset file
+        optimize_memory : bool
+            Whether to optimize memory usage
+        Returns:
+        --------
+        pd.DataFrame
+            The loaded dataset
+        """
+        print(f"Loading dataset: {name} from {file_path}")
+        # Get column dtypes if specified in config
+        dtype_dict = self.data_config.get(name, {}).get('dtypes', None)
+        # Load with chunk size for large files to avoid memory issues
+        if file_path.endswith('.csv'):
+            try:
+                if dtype_dict:
+                    df = pd.read_csv(file_path, dtype=dtype_dict)
+                else:
+                    # For large files, read in chunks and concatenate
+                    chunks = []
+                    for chunk in pd.read_csv(file_path, chunksize=100000):
+                        if optimize_memory:
+                            chunk = self.optimize_dtypes(chunk)
+                        chunks.append(chunk)
+                    df = pd.concat(chunks, axis=0)
+                    del chunks
+                    gc.collect()
+            except Exception as e:
+                print(f"Error loading CSV: {e}")
+                return None
+        else:
+            print(f"Unsupported file format: {file_path}")
+            return None
+        # Convert date columns
+        if 'Date' in df.columns:
+            df['Date'] = pd.to_datetime(df['Date'])
+        # Optimize memory usage
+        if optimize_memory and dtype_dict is None:
+            df = self.optimize_dtypes(df)
+        # Store dataset
+        self.datasets[name] = df
+        print(f"Dataset {name} loaded: {df.shape} - Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
+        return df
+    def merge_datasets(self, datasets: List[str], on: List[str], how: str = 'inner') -> pd.DataFrame:
+        """
+        Merge multiple datasets
+        Parameters:
+        -----------
+        datasets : List[str]
+            List of dataset names to merge
+        on : List[str]
+            Columns to merge on
+        how : str
+            Type of merge to perform
+        Returns:
+        --------
+        pd.DataFrame
+            Merged dataset
+        """
+        if not datasets or len(datasets) < 2:
+            print("Need at least two datasets to merge")
+            return None
+        # Start with the first dataset
+        result = self.datasets[datasets[0]].copy()
+        # Merge with the rest
+        for i in range(1, len(datasets)):
+            result = result.merge(self.datasets[datasets[i]], on=on, how=how)
+            gc.collect()  # Force garbage collection to free memory
+        print(f"Merged dataset shape: {result.shape} - Memory usage: {result.memory_usage().sum() / 1024**2:.2f} MB")
+        return result
+def load_data(file_path='US_engineered_features.csv', optimize_memory=True):
+    """
+    Load and preprocess the COVID-19 data
+    """
+    # Load the data
+    print(f"Loading data from {file_path}...")
+    # For large files, read in chunks
+    if optimize_memory:
+        chunks = []
+        for chunk in pd.read_csv(file_path, chunksize=100000):
+            # Optimize dtypes for each chunk
+            # Convert integer columns to optimal integer type
+            for col in chunk.select_dtypes(include=['int']).columns:
+                if chunk[col].min() >= 0:
+                    if chunk[col].max() <= 255:
+                        chunk[col] = chunk[col].astype(np.uint8)
+                    elif chunk[col].max() <= 65535:
+                        chunk[col] = chunk[col].astype(np.uint16)
+                    else:
+                        chunk[col] = chunk[col].astype(np.uint32)
+                else:
+                    if chunk[col].min() >= -128 and chunk[col].max() <= 127:
+                        chunk[col] = chunk[col].astype(np.int8)
+                    elif chunk[col].min() >= -32768 and chunk[col].max() <= 32767:
+                        chunk[col] = chunk[col].astype(np.int16)
+                    else:
+                        chunk[col] = chunk[col].astype(np.int32)
+            # Convert float columns to float32
+            for col in chunk.select_dtypes(include=['float']).columns:
+                chunk[col] = chunk[col].astype(np.float32)
+            chunks.append(chunk)
+        df = pd.concat(chunks, axis=0)
+        del chunks
+        gc.collect()  # Force garbage collection
+    else:
+        df = pd.read_csv(file_path)
+    # Convert Date to datetime
+    if 'Date' in df.columns:
+        df['Date'] = pd.to_datetime(df['Date'])
+    # Display basic information
+    print(f"Data shape: {df.shape}")
+    if 'Date' in df.columns:
+        print(f"Time period: {df['Date'].min()} to {df['Date'].max()}")
+    return df
+def preprocess_data(df, target_column='New_Confirmed', prediction_days=7, test_size=0.2):
+    """
+    Preprocess the data for regression modeling
+    Parameters:
+    - df: DataFrame containing the COVID-19 data
+    - target_column: The column to predict
+    - prediction_days: Number of days ahead to predict
+    - test_size: Proportion of data to use for testing
+    Returns:
+    - X_train, X_test, y_train, y_test: Train and test sets
+    - feature_names: Names of the features used for prediction
+    - scaler: The fitted scaler for inverse transformations
+    """
+    # Convert Date to datetime if not already
+    if 'Date' in df.columns and not pd.api.types.is_datetime64_any_dtype(df['Date']):
+        df['Date'] = pd.to_datetime(df['Date'])
+    # Create a shifted target column for prediction
+    df[f'{target_column}_future_{prediction_days}d'] = df[target_column].shift(-prediction_days)
+    # Drop rows with NaN values (typically the last n rows where future data is not available)
+    df = df.dropna(subset=[f'{target_column}_future_{prediction_days}d'])
+    # Remove non-numeric columns and columns that would cause data leakage
+    non_feature_cols = ['Date', 'Country/Region', f'{target_column}_future_{prediction_days}d']
+    leakage_cols = [col for col in df.columns if 'future' in col and col != f'{target_column}_future_{prediction_days}d']
+    # For regression models, we'll use all available numeric features
+    features = df.columns.difference(non_feature_cols + leakage_cols)
+    # Select features and target
+    X = df[features]
+    y = df[f'{target_column}_future_{prediction_days}d']
+    # Fill missing values with median for numerical features or mode for categorical
+    for col in X.columns:
+        if X[col].isna().sum() > 0:
+            if np.issubdtype(X[col].dtype, np.number):
+                X[col] = X[col].fillna(X[col].median())
+            else:
+                X[col] = X[col].fillna(X[col].mode()[0])
+    # Split data into training and testing sets
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=test_size, shuffle=False
+    )
+    # Convert pandas DataFrames to NumPy arrays to save memory
+    X_train_np = X_train.values
+    X_test_np = X_test.values
+    # Scale the features
+    scaler = StandardScaler()
+    X_train_scaled = scaler.fit_transform(X_train_np)
+    X_test_scaled = scaler.transform(X_test_np)
+    # Release memory
+    del X_train_np, X_test_np
+    gc.collect()
+    print(f"Training set shape: {X_train.shape}")
+    print(f"Testing set shape: {X_test.shape}")
+    print(f"Features used: {len(features)}")
+    return X_train_scaled, X_test_scaled, y_train, y_test, list(features), scaler
+def main():
+    """
+    Main function to demonstrate data preprocessing
+    """
+    # Define data configuration for memory optimization
+    data_config = {
+        'covid_full': {
+            'dtypes': {
+                'Country/Region': 'category',
+                'Confirmed': 'int32',
+                'Deaths': 'int32',
+                'Recovered': 'int32',
+                'New_Confirmed': 'int32',
+                'New_Deaths': 'int32',
+                'New_Recovered': 'int32',
+            }
+        },
+        'us_engineered': {
+            'dtypes': {
+                'Country/Region': 'category',
+                'Confirmed': 'int32',
+                'Deaths': 'int32',
+                'New_Confirmed': 'int32',
+                'New_Deaths': 'int32',
+                'Day': 'int8',
+                'Day_of_week': 'int8',
+                'Month': 'int8',
+                'Year': 'int16',
+            }
+        }
+    }
+    # Choose the approach:
+    # 1. Basic approach (backward compatible)
+    # 2. Advanced multi-dataset approach
+    approach = 2  # Change to 1 for basic approach
+    if approach == 1:
+        # Basic approach (legacy code)
+        print("Using basic approach...")
+        df = load_data(optimize_memory=True)
+        X_train, X_test, y_train, y_test, features, scaler = preprocess_data(df)
+    else:
+        # Advanced multi-dataset approach
+        print("Using advanced multi-dataset approach...")
+        processor = COVIDDataProcessor(data_config)
+        # Load datasets
+        processor.load_dataset('covid_full', 'covid_full_dataset.csv')
+        processor.load_dataset('us_engineered', 'US_engineered_features.csv')
+        # Example: You can also load other datasets and merge them
+        # processor.load_dataset('raw_confirmed', 'raw_confirmed.csv')
+        # processor.load_dataset('raw_deaths', 'raw_deaths.csv')
+        # processor.merge_datasets(['raw_confirmed', 'raw_deaths'], on=['Date', 'Country/Region'])
+        # For simplicity, we'll just use the US engineered features dataset
+        # But you could use any merged or single dataset here
+        X_train, X_test, y_train, y_test, features, scaler = preprocess_data(
+            processor.datasets['us_engineered']
+        )
+    print("\nPreprocessing complete!")
+    print(f"Number of training samples: {len(X_train)}")
+    print(f"Number of testing samples: {len(X_test)}")
+    print(f"Target range: {y_train.min()} to {y_train.max()}")
+    # Save the preprocessed data - using numpy's compressed format to save space
+    np.savez_compressed('X_train.npz', data=X_train)
+    np.savez_compressed('X_test.npz', data=X_test)
+    np.savez_compressed('y_train.npz', data=y_train)
+    np.savez_compressed('y_test.npz', data=y_test)
+    # Also save as .npy for backward compatibility
+    np.save('X_train.npy', X_train)
+    np.save('X_test.npy', X_test)
+    np.save('y_train.npy', y_train if isinstance(y_train, np.ndarray) else y_train.values)
+    np.save('y_test.npy', y_test if isinstance(y_test, np.ndarray) else y_test.values)
+    # Save features list
+    with open('features.txt', 'w') as f:
+        for feature in features:
+            f.write(f"{feature}\n")
+    # Save scaler
+    import joblib
+    joblib.dump(scaler, 'scaler.pkl')
+    print("Preprocessed data saved!")
+    # Print memory usage stats
+    import psutil
+    process = psutil.Process(os.getpid())
+    print(f"Current memory usage: {process.memory_info().rss / (1024 * 1024):.2f} MB")
+if __name__ == "__main__":
+    main()

raw_confirmed.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

raw_deaths.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

raw_owid.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2c25df2d17be38533d88d9b714d80050f615519ab39f1ef5831ae38db7dff46
+size 107886083

raw_recovered.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+pandas==2.1.0
+numpy==1.26.0
+matplotlib==3.8.0
+seaborn==0.13.0
+scikit-learn==1.3.0
+xgboost==2.0.0
+lightgbm==4.1.0
+shap==0.43.0
+gradio==4.10.0
+joblib==1.3.2

run_pipeline.py ADDED Viewed

	@@ -0,0 +1,93 @@

+#!/usr/bin/env python3
+"""
+COVID-19 Prediction Pipeline
+---------------------------
+This script runs the complete pipeline from data preprocessing to model training
+and launches the Gradio UI.
+"""
+import os
+import argparse
+import subprocess
+import time
+def clear_screen():
+    """Clear the terminal screen"""
+    os.system('cls' if os.name == 'nt' else 'clear')
+def run_command(command, description):
+    """Run a system command and print output"""
+    print(f"\n{'=' * 80}")
+    print(f"STEP: {description}")
+    print(f"{'=' * 80}\n")
+    print(f"Running: {command}\n")
+    # Run the command
+    start_time = time.time()
+    result = subprocess.run(command, shell=True)
+    end_time = time.time()
+    # Check if command was successful
+    if result.returncode == 0:
+        print(f"\nSuccess! Completed in {end_time - start_time:.2f} seconds")
+    else:
+        print(f"\nError! Command failed with exit code {result.returncode}")
+        exit(1)
+    print(f"\n{'=' * 80}\n")
+    time.sleep(1)
+def parse_args():
+    """Parse command-line arguments"""
+    parser = argparse.ArgumentParser(description="COVID-19 Prediction Pipeline")
+    parser.add_argument("--skip-preprocessing", action="store_true", help="Skip data preprocessing")
+    parser.add_argument("--skip-training", action="store_true", help="Skip model training")
+    parser.add_argument("--only-ui", action="store_true", help="Only launch the Gradio UI")
+    return parser.parse_args()
+def main():
+    """Run the complete pipeline"""
+    args = parse_args()
+    # Display welcome banner
+    clear_screen()
+    print("\n" + "=" * 80)
+    print("COVID-19 PREDICTION PIPELINE".center(80))
+    print("=" * 80 + "\n")
+    print("This script will run the complete pipeline:")
+    print("1. Data preprocessing")
+    print("2. Model training")
+    print("3. Launch Gradio UI for predictions")
+    print("\nPress Ctrl+C at any time to stop the pipeline.")
+    print()
+    try:
+        # Step 1: Data Preprocessing
+        if args.only_ui:
+            print("Skipping preprocessing and training, launching UI only...")
+        else:
+            if not args.skip_preprocessing:
+                run_command("python preprocess_data.py", "Data Preprocessing")
+            else:
+                print("Skipping preprocessing as requested.")
+            # Step 2: Model Training
+            if not args.skip_training:
+                run_command("python train_models.py", "Model Training")
+            else:
+                print("Skipping model training as requested.")
+        # Step 3: Launch Gradio UI
+        print("\nLaunching Gradio UI for predictions...")
+        run_command("python gradio_app.py", "Gradio UI Launch")
+    except KeyboardInterrupt:
+        print("\n\nPipeline interrupted by user. Exiting.")
+        exit(0)
+    except Exception as e:
+        print(f"\n\nError in pipeline: {str(e)}")
+        exit(1)
+if __name__ == "__main__":
+    main()

train_models.py ADDED Viewed

	@@ -0,0 +1,165 @@

+import numpy as np
+import pandas as pd
+from sklearn.linear_model import LinearRegression
+from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
+from sklearn.svm import SVR
+from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
+import matplotlib.pyplot as plt
+import seaborn as sns
+import joblib
+import os
+import gc
+import psutil
+from typing import Dict, List, Tuple
+# Set the style for plots
+sns.set(style="whitegrid")
+# Set up memory monitoring
+def print_memory_usage():
+    process = psutil.Process(os.getpid())
+    memory_usage = process.memory_info().rss / (1024 * 1024)  # Convert to MB
+    print(f"Current memory usage: {memory_usage:.2f} MB")
+def train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names=None):
+    """
+    Train and evaluate multiple regression models for COVID-19 prediction
+    Parameters:
+    - X_train, X_test, y_train, y_test: Training and testing data
+    - feature_names: List of feature names (for feature importance)
+    Returns:
+    - models: Dictionary of trained models
+    - metrics: Dictionary of evaluation metrics for each model
+    """
+    models = {
+        'Linear Regression': LinearRegression(),
+        'Support Vector Regression': SVR(kernel='rbf', gamma='scale', C=1.0, epsilon=0.1),
+        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
+        'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
+    }
+    metrics = {
+        'Model': [],
+        'RMSE': [],
+        'MAE': [],
+        'R²': []
+    }
+    for name, model in models.items():
+        print(f"Training {name}...")
+        model.fit(X_train, y_train)
+        # Predict
+        y_pred = model.predict(X_test)
+        # Calculate metrics
+        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
+        mae = mean_absolute_error(y_test, y_pred)
+        r2 = r2_score(y_test, y_pred)
+        # Store metrics
+        metrics['Model'].append(name)
+        metrics['RMSE'].append(rmse)
+        metrics['MAE'].append(mae)
+        metrics['R²'].append(r2)
+        print(f"{name} - RMSE: {rmse:.2f}, MAE: {mae:.2f}, R²: {r2:.4f}")
+        # Save the model
+        joblib.dump(model, f'{name.replace(" ", "_").lower()}_model.pkl')
+        # Plot actual vs predicted
+        plt.figure(figsize=(10, 6))
+        plt.scatter(y_test, y_pred, alpha=0.5)
+        plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
+        plt.title(f'{name} - Actual vs Predicted')
+        plt.xlabel('Actual')
+        plt.ylabel('Predicted')
+        plt.savefig(f'{name.replace(" ", "_").lower()}_predictions.png')
+        # If it's Random Forest or Gradient Boosting, plot feature importance
+        if name in ['Random Forest', 'Gradient Boosting'] and feature_names is not None:
+            plt.figure(figsize=(12, 8))
+            feature_importance = model.feature_importances_
+            sorted_idx = np.argsort(feature_importance)
+            # Select top 15 features for better visualization
+            top_k = min(15, len(feature_importance))
+            plt.barh(range(top_k), feature_importance[sorted_idx][-top_k:])
+            plt.yticks(range(top_k), [feature_names[i] for i in sorted_idx[-top_k:]])
+            plt.title(f'{name} - Top {top_k} Feature Importance')
+            plt.tight_layout()
+            plt.savefig(f'{name.replace(" ", "_").lower()}_feature_importance.png')
+    # Plot comparison of models
+    metrics_df = pd.DataFrame(metrics)
+    # Create bar plot for RMSE and MAE
+    plt.figure(figsize=(12, 6))
+    bar_width = 0.35
+    index = np.arange(len(metrics_df['Model']))
+    plt.bar(index, metrics_df['RMSE'], bar_width, label='RMSE')
+    plt.bar(index + bar_width, metrics_df['MAE'], bar_width, label='MAE')
+    plt.xlabel('Model')
+    plt.ylabel('Error')
+    plt.title('Model Comparison - RMSE and MAE')
+    plt.xticks(index + bar_width / 2, metrics_df['Model'], rotation=45)
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig('model_comparison_error.png')
+    # Create bar plot for R²
+    plt.figure(figsize=(12, 6))
+    plt.bar(metrics_df['Model'], metrics_df['R²'], color='skyblue')
+    plt.xlabel('Model')
+    plt.ylabel('R²')
+    plt.title('Model Comparison - R²')
+    plt.xticks(rotation=45)
+    plt.tight_layout()
+    plt.savefig('model_comparison_r2.png')
+    print("\nModel training and evaluation complete!")
+    print(f"Models saved as: {', '.join([f'{name.replace(' ', '_').lower()}_model.pkl' for name in models.keys()])}")
+    return models, metrics_df
+def main():
+    """
+    Main function to train and evaluate models
+    """
+    # Check if preprocessed data exists
+    if not all(os.path.exists(f) for f in ['X_train.npy', 'X_test.npy', 'y_train.npy', 'y_test.npy']):
+        print("Preprocessed data not found. Please run preprocess_data.py first.")
+        return
+    # Load preprocessed data
+    X_train = np.load('X_train.npy')
+    X_test = np.load('X_test.npy')
+    y_train = np.load('y_train.npy')
+    y_test = np.load('y_test.npy')
+    # Load feature names
+    feature_names = []
+    if os.path.exists('features.txt'):
+        with open('features.txt', 'r') as f:
+            feature_names = [line.strip() for line in f.readlines()]
+    print("Data loaded successfully!")
+    print(f"Training data shape: {X_train.shape}")
+    print(f"Testing data shape: {X_test.shape}")
+    # Train and evaluate models
+    models, metrics = train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_names)
+    # Display and save comparison table
+    print("\nModel Comparison:")
+    print(metrics)
+    metrics.to_csv('model_comparison.csv', index=False)
+if __name__ == "__main__":
+    main()