metadata

title: AutoML
emoji: 🦀
colorFrom: blue
colorTo: pink
sdk: streamlit
sdk_version: 1.44.0
app_file: app.py
pinned: true
license: mit
short_description: Automated Machine Learning platform
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/66c623e4c36beb1532189397/Hp59Si4oWEY4X4D95ZPRU.png

AutoML - Automated Machine Learning Platform

AutoML is a powerful tool for automating the end-to-end process of applying machine learning to real-world problems. It simplifies the process of model selection, hyperparameter tuning, and downloading, making machine learning accessible to everyone.

🔗 Live Demo

Check out the live demo of AutoML and experience the power of automated machine learning firsthand!

🎬 Video Showcase

AutoML Demonstration

See AutoML in action: This demonstration shows how to analyze data, train models, and get AI-powered insights in minutes!

✨ Features

📊 Data Visualization and Analysis: Interactive visualizations to understand your data
- Correlation heatmaps
- Distribution plots
- Feature importance charts
- Pair plots for relationship analysis
🧹 Automated Data Cleaning and Preprocessing: Handle missing values, outliers, and feature engineering
- Automatic detection and handling of missing values
- Outlier detection and treatment
- Feature scaling and normalization
- Categorical encoding (One-Hot, Label, Target encoding)
🤖 Multiple ML Model Selection: Choose from a variety of models or let AutoML select the best one
- Classification models: Logistic Regression, Random Forest, XGBoost, SVC, Decision Tree, KNN, Gradient Boosting, AdaBoost, Gaussian Naive Bayes, QDA, LDA
- Regression models: Linear Regression, Random Forest, XGBoost, SVR, Decision Tree, KNN, ElasticNet, Gradient Boosting, AdaBoost, Bayesian Ridge, Ridge, Lasso
⚙️ Hyperparameter Tuning: Optimize model performance with advanced tuning techniques
- Added Support for 20+ Models to easily fine tune hyperparameters
- Added Support for 10+ Hyperparameter Tuning Techniques
📈 Model Performance Evaluation: Comprehensive metrics and visualizations
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
- Regression: MAE, MSE, RMSE, R², Residual Plots
🔍 AI-powered Data Insights: Leverage Google's Gemini for intelligent data analysis
- Natural language explanations of model decisions
- Automated feature importance interpretation
- Data quality assessment
- Trend identification and anomaly detection
🧠 LLM Fine-Tuning and Download: Access and utilize pre-trained language models
- Download fine-tuned LLMs for specific domains
- Customize existing models for your specific use case
- Access to various model sizes (small, medium, large)
- Seamless integration with your data processing pipeline

🚀 Installation

Prerequisites

Python 3.8 or higher
Google API key for Gemini for data insights and dataframe cleaning
Groq API key for LLM based test results analysis
langsmith API for monitoring llm calls

Setup

Clone the repository:

git clone <repository-url>
cd Auto-ML

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up your environment variables:

# Create a .env file with your Google API key as well as other keys
echo "GOOGLE_API_KEY=your_api_key_here" > .env

🎮 Usage

Start the application:

streamlit run app.py

Quick Start Guide

Upload Data: Upload your CSV file
- Supported format: CSV
- Automatic data type detection
- Preview of first few rows
Explore Data: Visualize and understand your dataset
- Summary statistics
- Correlation analysis
- Distribution visualization
- Missing value analysis
Preprocess: Clean and transform your data
- Handle missing values (imputation strategies)
- Remove or transform outliers
- Feature scaling options
- Encoding categorical variables
Train Models: Select models and tune hyperparameters
- Choose target variable and features
- Select machine learning algorithms
- Configure hyperparameter search space
- Set evaluation metrics
Evaluate: Compare model performance
- Performance metrics visualization
- Feature importance analysis
- Model comparison dashboard
- Cross-validation results
Deploy: Export your model
- Download trained model as pickle file

🧩 Project Structure

Auto-ML/
├── app.py                  # Main Streamlit application
├── requirements.txt        # Project dependencies
├── .env                    # Environment variables (API keys)
├── README.md               # Project documentation
├── models/                 # Saved model files
├── logs/                   # Application logs
└── src/                    # Source code
    ├── __init__.py         # Package initialization
    ├── preprocessing/      # Data preprocessing modules
    │   ├── __init__.py
    │   └── ...             # Data cleaning, transformation
    ├── training/           # Model training modules
    │   ├── __init__.py
    │   └── ...             # Model training, evaluation
    ├── ui/                 # User interface components
    │   ├── __init__.py
    │   └── ...             # Streamlit UI elements
    └── utils/              # Utility functions
        ├── __init__.py
        └── ...             # Helper functions

Preprocessing Pipelines

1. Data Ingestion Pipeline

Purpose: Collects raw data from multiple sources (CSV, databases, APIs).

Reads structured/unstructured data
Handles missing values and duplicates
Converts raw data into a clean DataFrame

2. Data Cleaning & Preprocessing Pipeline

Purpose: Transforms raw data into a machine-learning-ready format.

Cleans Data: Handles NaNs, outliers, and standardizes columns
Encodes Categorical Features: One-hot encoding, label encoding
Scales Numerical Data: MinMaxScaler, StandardScaler

3. Model Selection & Training Pipeline

Purpose: Automates the process of selecting and training.

Multiple Algorithms: Trains XGBoost, RandomForest, Deep Learning models
Hyperparameter Optimization: Finds the best config for each model

6. Model Deployment Pipeline

Purpose: Makes the model available for real-world usage.

Exports the Model (Pickle, ONNX, TensorFlow SavedModel)
Easily Download after training

Feedback and Fallback Mechanism

AutoML implements a robust feedback and fallback system to ensure reliability:

Data Cleaning Validation: The system validates all cleaning operations and provides feedback on the changes made
- Automatic detection of cleaning effectiveness
- Detailed logs of transformations applied to the data
LLM Fallback Mechanism: For AI-powered insights and data analysis
- Primary attempt uses advanced LLMs (Google Gemini/Groq)
- Automatic fallback to rule-based algorithms if LLM fails
- Graceful degradation to ensure core functionality remains available
- Error logging and reporting for continuous improvement
- LangSmith integration for monitoring and tracking all LLM calls
Error Feedback Loop: Intelligent error handling during data cleaning
- Automatically captures errors that occur during data cleaning operations
- Sends error context to LLM to generate refined cleaning code
- Re-executes the improved cleaning process
- Iterative refinement ensures robust data preparation even with challenging datasets

🤝 Contributing

We welcome contributions!

Development Setup

Fork the repository
Create a feature branch
Install development dependencies:
```
pip install -r requirements-dev.txt
```
Make your changes
Run tests:
```
pytest
```
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

Streamlit for the interactive web framework
Scikit-learn for machine learning algorithms
Pandas for data manipulation
Plotly for interactive visualizations
Google Gemini for AI-powered insights
XGBoost for gradient boosting
Seaborn for statistical visualizations
LangChain for large language model integration
LangSmith for LLM call tracking and monitoring
Groq for high-performance computing

Made with ❤️ by Akash Anandani