AutoML / README.md
kashh65's picture
Update README.md
0f835a2 verified

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade
metadata
title: AutoML
emoji: ๐Ÿฆ€
colorFrom: blue
colorTo: pink
sdk: streamlit
sdk_version: 1.44.0
app_file: app.py
pinned: true
license: mit
short_description: Automated Machine Learning platform
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/66c623e4c36beb1532189397/Hp59Si4oWEY4X4D95ZPRU.png

AutoML - Automated Machine Learning Platform

License: MIT Made with Python Made with Streamlit Made with Scikit-Learn

Made with Pandas Made with NumPy Made with Matplotlib Made with Seaborn Made with Plotly Made with XGBoost

Made with LangChain Monitored with LangSmith Powered by Google Gemini Powered by Groq Made with python-dotenv Uses pickle

AutoML is a powerful tool for automating the end-to-end process of applying machine learning to real-world problems. It simplifies the process of model selection, hyperparameter tuning, and downloading, making machine learning accessible to everyone.

๐Ÿ”— Live Demo

Try the Demo

Check out the live demo of AutoML and experience the power of automated machine learning firsthand!

๐ŸŽฌ Video Showcase

AutoML Demonstration

See AutoML in action: This demonstration shows how to analyze data, train models, and get AI-powered insights in minutes!

โœจ Features

  • ๐Ÿ“Š Data Visualization and Analysis: Interactive visualizations to understand your data

    • Correlation heatmaps
    • Distribution plots
    • Feature importance charts
    • Pair plots for relationship analysis
  • ๐Ÿงน Automated Data Cleaning and Preprocessing: Handle missing values, outliers, and feature engineering

    • Automatic detection and handling of missing values
    • Outlier detection and treatment
    • Feature scaling and normalization
    • Categorical encoding (One-Hot, Label, Target encoding)
  • ๐Ÿค– Multiple ML Model Selection: Choose from a variety of models or let AutoML select the best one

    • Classification models: Logistic Regression, Random Forest, XGBoost, SVC, Decision Tree, KNN, Gradient Boosting, AdaBoost, Gaussian Naive Bayes, QDA, LDA
    • Regression models: Linear Regression, Random Forest, XGBoost, SVR, Decision Tree, KNN, ElasticNet, Gradient Boosting, AdaBoost, Bayesian Ridge, Ridge, Lasso
  • โš™๏ธ Hyperparameter Tuning: Optimize model performance with advanced tuning techniques

    • Added Support for 20+ Models to easily fine tune hyperparameters
    • Added Support for 10+ Hyperparameter Tuning Techniques
  • ๐Ÿ“ˆ Model Performance Evaluation: Comprehensive metrics and visualizations

    • Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
    • Regression: MAE, MSE, RMSE, Rยฒ, Residual Plots
  • ๐Ÿ” AI-powered Data Insights: Leverage Google's Gemini for intelligent data analysis

    • Natural language explanations of model decisions
    • Automated feature importance interpretation
    • Data quality assessment
    • Trend identification and anomaly detection
  • ๐Ÿง  LLM Fine-Tuning and Download: Access and utilize pre-trained language models

    • Download fine-tuned LLMs for specific domains
    • Customize existing models for your specific use case
    • Access to various model sizes (small, medium, large)
    • Seamless integration with your data processing pipeline

๐Ÿš€ Installation

Prerequisites

  • Python 3.8 or higher
  • Google API key for Gemini for data insights and dataframe cleaning
  • Groq API key for LLM based test results analysis
  • langsmith API for monitoring llm calls

Setup

  1. Clone the repository:
git clone <repository-url>
cd Auto-ML
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your environment variables:
# Create a .env file with your Google API key as well as other keys
echo "GOOGLE_API_KEY=your_api_key_here" > .env

๐ŸŽฎ Usage

Start the application:

streamlit run app.py

Quick Start Guide

  1. Upload Data: Upload your CSV file

    • Supported format: CSV
    • Automatic data type detection
    • Preview of first few rows
  2. Explore Data: Visualize and understand your dataset

    • Summary statistics
    • Correlation analysis
    • Distribution visualization
    • Missing value analysis
  3. Preprocess: Clean and transform your data

    • Handle missing values (imputation strategies)
    • Remove or transform outliers
    • Feature scaling options
    • Encoding categorical variables
  4. Train Models: Select models and tune hyperparameters

    • Choose target variable and features
    • Select machine learning algorithms
    • Configure hyperparameter search space
    • Set evaluation metrics
  5. Evaluate: Compare model performance

    • Performance metrics visualization
    • Feature importance analysis
    • Model comparison dashboard
    • Cross-validation results
  6. Deploy: Export your model

    • Download trained model as pickle file

๐Ÿงฉ Project Structure

Auto-ML/
โ”œโ”€โ”€ app.py                  # Main Streamlit application
โ”œโ”€โ”€ requirements.txt        # Project dependencies
โ”œโ”€โ”€ .env                    # Environment variables (API keys)
โ”œโ”€โ”€ README.md               # Project documentation
โ”œโ”€โ”€ models/                 # Saved model files
โ”œโ”€โ”€ logs/                   # Application logs
โ””โ”€โ”€ src/                    # Source code
    โ”œโ”€โ”€ __init__.py         # Package initialization
    โ”œโ”€โ”€ preprocessing/      # Data preprocessing modules
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Data cleaning, transformation
    โ”œโ”€โ”€ training/           # Model training modules
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Model training, evaluation
    โ”œโ”€โ”€ ui/                 # User interface components
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Streamlit UI elements
    โ””โ”€โ”€ utils/              # Utility functions
        โ”œโ”€โ”€ __init__.py
        โ””โ”€โ”€ ...             # Helper functions

Preprocessing Pipelines

1. Data Ingestion Pipeline

Purpose: Collects raw data from multiple sources (CSV, databases, APIs).

  • Reads structured/unstructured data
  • Handles missing values and duplicates
  • Converts raw data into a clean DataFrame

2. Data Cleaning & Preprocessing Pipeline

Purpose: Transforms raw data into a machine-learning-ready format.

  • Cleans Data: Handles NaNs, outliers, and standardizes columns
  • Encodes Categorical Features: One-hot encoding, label encoding
  • Scales Numerical Data: MinMaxScaler, StandardScaler

3. Model Selection & Training Pipeline

Purpose: Automates the process of selecting and training.

  • Multiple Algorithms: Trains XGBoost, RandomForest, Deep Learning models
  • Hyperparameter Optimization: Finds the best config for each model

6. Model Deployment Pipeline

Purpose: Makes the model available for real-world usage.

  • Exports the Model (Pickle, ONNX, TensorFlow SavedModel)
  • Easily Download after training

Feedback and Fallback Mechanism

AutoML implements a robust feedback and fallback system to ensure reliability:

  1. Data Cleaning Validation: The system validates all cleaning operations and provides feedback on the changes made

    • Automatic detection of cleaning effectiveness
    • Detailed logs of transformations applied to the data
  2. LLM Fallback Mechanism: For AI-powered insights and data analysis

    • Primary attempt uses advanced LLMs (Google Gemini/Groq)
    • Automatic fallback to rule-based algorithms if LLM fails
    • Graceful degradation to ensure core functionality remains available
    • Error logging and reporting for continuous improvement
    • LangSmith integration for monitoring and tracking all LLM calls
  3. Error Feedback Loop: Intelligent error handling during data cleaning

    • Automatically captures errors that occur during data cleaning operations
    • Sends error context to LLM to generate refined cleaning code
    • Re-executes the improved cleaning process
    • Iterative refinement ensures robust data preparation even with challenging datasets

๐Ÿค Contributing

We welcome contributions!

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Install development dependencies:
    pip install -r requirements-dev.txt
    
  4. Make your changes
  5. Run tests:
    pytest
    
  6. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgements


Made with โค๏ธ by Akash Anandani