disease_prediction / README.md
vishwak1's picture
Upload 14 files
fb61aba verified

COVID-19 Prediction Model

This project implements a COVID-19 prediction system using regression models with a focus on Random Forest and three other regression models. The system includes a Gradio user interface for Hugging Face deployment.

Features

  • Memory-optimized data processing that can handle multiple datasets of different types and object types
  • Multiple regression models for comparison:
    • Random Forest Regression
    • Linear Regression
    • Support Vector Regression (SVR)
    • Gradient Boosting Regression
  • Gradio UI for easy model selection, visualization, and deployment to Hugging Face Spaces
  • Complete data preprocessing pipeline with feature engineering
  • Performance evaluation metrics and visualization

Project Structure

COVID-19-Prediction/
β”œβ”€β”€ covid_full_dataset.csv       # Complete COVID-19 dataset
β”œβ”€β”€ US_engineered_features.csv   # Engineered features for US data
β”œβ”€β”€ raw_confirmed.csv            # Raw confirmed cases data
β”œβ”€β”€ raw_deaths.csv               # Raw deaths data
β”œβ”€β”€ raw_recovered.csv            # Raw recovered cases data
β”œβ”€β”€ raw_owid.csv                 # Additional data from Our World in Data
β”œβ”€β”€ covid_country_ts.csv         # Country-level time series data
β”œβ”€β”€ preprocess_data.py           # Data preprocessing script
β”œβ”€β”€ train_models.py              # Model training script
β”œβ”€β”€ gradio_app.py                # Gradio UI for predictions
β”œβ”€β”€ run_pipeline.py              # Complete pipeline runner
└── requirements.txt             # Project dependencies

Installation

  1. Clone this repository:

    git clone https://github.com/yourusername/covid19-prediction.git
    cd covid19-prediction
    
  2. Install the required packages:

    pip install -r requirements.txt
    

Usage

Run the Complete Pipeline

To run the complete pipeline (preprocessing, training, and UI):

python run_pipeline.py

Pipeline Options

  • Skip preprocessing: python run_pipeline.py --skip-preprocessing
  • Skip training: python run_pipeline.py --skip-training
  • Only launch UI: python run_pipeline.py --only-ui

Run Individual Steps

  1. Data Preprocessing:

    python preprocess_data.py
    
  2. Model Training:

    python train_models.py
    
  3. Launch Gradio UI:

    python gradio_app.py
    

Memory Optimization

This project is optimized to handle large datasets efficiently:

  • Uses appropriate data types to minimize memory footprint
  • Processes data in chunks for large files
  • Employs garbage collection to free memory
  • Uses compressed NumPy formats for storing processed data
  • Optimizes model parameters for memory efficiency

Models

The project implements and compares four regression models:

  1. Random Forest Regressor: An ensemble learning method that builds multiple decision trees and merges their predictions.
  2. Linear Regression: A simple baseline model that assumes a linear relationship between features and target.
  3. Support Vector Regression (SVR): Uses support vectors to create a regression model that can capture non-linear relationships.
  4. Gradient Boosting Regressor: An ensemble technique that builds trees sequentially, with each tree correcting errors made by previous ones.

Hugging Face Deployment

The Gradio interface is configured for easy deployment to Hugging Face Spaces:

  1. Create a new Space on Hugging Face
  2. Upload all files to the Space
  3. The app will automatically configure for the Hugging Face environment

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Data sources: Johns Hopkins CSSE, Our World in Data
  • Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Gradio