A newer version of the Streamlit SDK is available:
1.45.1
title: AutoML
emoji: ๐ฆ
colorFrom: blue
colorTo: pink
sdk: streamlit
sdk_version: 1.44.0
app_file: app.py
pinned: true
license: mit
short_description: Automated Machine Learning platform
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/66c623e4c36beb1532189397/Hp59Si4oWEY4X4D95ZPRU.png
AutoML is a powerful tool for automating the end-to-end process of applying machine learning to real-world problems. It simplifies the process of model selection, hyperparameter tuning, and downloading, making machine learning accessible to everyone.
๐ Live Demo
Check out the live demo of AutoML and experience the power of automated machine learning firsthand!
๐ฌ Video Showcase
See AutoML in action: This demonstration shows how to analyze data, train models, and get AI-powered insights in minutes!
โจ Features
๐ Data Visualization and Analysis: Interactive visualizations to understand your data
- Correlation heatmaps
- Distribution plots
- Feature importance charts
- Pair plots for relationship analysis
๐งน Automated Data Cleaning and Preprocessing: Handle missing values, outliers, and feature engineering
- Automatic detection and handling of missing values
- Outlier detection and treatment
- Feature scaling and normalization
- Categorical encoding (One-Hot, Label, Target encoding)
๐ค Multiple ML Model Selection: Choose from a variety of models or let AutoML select the best one
- Classification models: Logistic Regression, Random Forest, XGBoost, SVC, Decision Tree, KNN, Gradient Boosting, AdaBoost, Gaussian Naive Bayes, QDA, LDA
- Regression models: Linear Regression, Random Forest, XGBoost, SVR, Decision Tree, KNN, ElasticNet, Gradient Boosting, AdaBoost, Bayesian Ridge, Ridge, Lasso
โ๏ธ Hyperparameter Tuning: Optimize model performance with advanced tuning techniques
- Added Support for 20+ Models to easily fine tune hyperparameters
- Added Support for 10+ Hyperparameter Tuning Techniques
๐ Model Performance Evaluation: Comprehensive metrics and visualizations
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
- Regression: MAE, MSE, RMSE, Rยฒ, Residual Plots
๐ AI-powered Data Insights: Leverage Google's Gemini for intelligent data analysis
- Natural language explanations of model decisions
- Automated feature importance interpretation
- Data quality assessment
- Trend identification and anomaly detection
๐ง LLM Fine-Tuning and Download: Access and utilize pre-trained language models
- Download fine-tuned LLMs for specific domains
- Customize existing models for your specific use case
- Access to various model sizes (small, medium, large)
- Seamless integration with your data processing pipeline
๐ Installation
Prerequisites
- Python 3.8 or higher
- Google API key for Gemini for data insights and dataframe cleaning
- Groq API key for LLM based test results analysis
- langsmith API for monitoring llm calls
Setup
- Clone the repository:
git clone <repository-url>
cd Auto-ML
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up your environment variables:
# Create a .env file with your Google API key as well as other keys
echo "GOOGLE_API_KEY=your_api_key_here" > .env
๐ฎ Usage
Start the application:
streamlit run app.py
Quick Start Guide
Upload Data: Upload your CSV file
- Supported format: CSV
- Automatic data type detection
- Preview of first few rows
Explore Data: Visualize and understand your dataset
- Summary statistics
- Correlation analysis
- Distribution visualization
- Missing value analysis
Preprocess: Clean and transform your data
- Handle missing values (imputation strategies)
- Remove or transform outliers
- Feature scaling options
- Encoding categorical variables
Train Models: Select models and tune hyperparameters
- Choose target variable and features
- Select machine learning algorithms
- Configure hyperparameter search space
- Set evaluation metrics
Evaluate: Compare model performance
- Performance metrics visualization
- Feature importance analysis
- Model comparison dashboard
- Cross-validation results
Deploy: Export your model
- Download trained model as pickle file
๐งฉ Project Structure
Auto-ML/
โโโ app.py # Main Streamlit application
โโโ requirements.txt # Project dependencies
โโโ .env # Environment variables (API keys)
โโโ README.md # Project documentation
โโโ models/ # Saved model files
โโโ logs/ # Application logs
โโโ src/ # Source code
โโโ __init__.py # Package initialization
โโโ preprocessing/ # Data preprocessing modules
โ โโโ __init__.py
โ โโโ ... # Data cleaning, transformation
โโโ training/ # Model training modules
โ โโโ __init__.py
โ โโโ ... # Model training, evaluation
โโโ ui/ # User interface components
โ โโโ __init__.py
โ โโโ ... # Streamlit UI elements
โโโ utils/ # Utility functions
โโโ __init__.py
โโโ ... # Helper functions
Preprocessing Pipelines
1. Data Ingestion Pipeline
Purpose: Collects raw data from multiple sources (CSV, databases, APIs).
- Reads structured/unstructured data
- Handles missing values and duplicates
- Converts raw data into a clean DataFrame
2. Data Cleaning & Preprocessing Pipeline
Purpose: Transforms raw data into a machine-learning-ready format.
- Cleans Data: Handles NaNs, outliers, and standardizes columns
- Encodes Categorical Features: One-hot encoding, label encoding
- Scales Numerical Data: MinMaxScaler, StandardScaler
3. Model Selection & Training Pipeline
Purpose: Automates the process of selecting and training.
- Multiple Algorithms: Trains XGBoost, RandomForest, Deep Learning models
- Hyperparameter Optimization: Finds the best config for each model
6. Model Deployment Pipeline
Purpose: Makes the model available for real-world usage.
- Exports the Model (Pickle, ONNX, TensorFlow SavedModel)
- Easily Download after training
Feedback and Fallback Mechanism
AutoML implements a robust feedback and fallback system to ensure reliability:
Data Cleaning Validation: The system validates all cleaning operations and provides feedback on the changes made
- Automatic detection of cleaning effectiveness
- Detailed logs of transformations applied to the data
LLM Fallback Mechanism: For AI-powered insights and data analysis
- Primary attempt uses advanced LLMs (Google Gemini/Groq)
- Automatic fallback to rule-based algorithms if LLM fails
- Graceful degradation to ensure core functionality remains available
- Error logging and reporting for continuous improvement
- LangSmith integration for monitoring and tracking all LLM calls
Error Feedback Loop: Intelligent error handling during data cleaning
- Automatically captures errors that occur during data cleaning operations
- Sends error context to LLM to generate refined cleaning code
- Re-executes the improved cleaning process
- Iterative refinement ensures robust data preparation even with challenging datasets
๐ค Contributing
We welcome contributions!
Development Setup
- Fork the repository
- Create a feature branch
- Install development dependencies:
pip install -r requirements-dev.txt
- Make your changes
- Run tests:
pytest
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgements
- Streamlit for the interactive web framework
- Scikit-learn for machine learning algorithms
- Pandas for data manipulation
- Plotly for interactive visualizations
- Google Gemini for AI-powered insights
- XGBoost for gradient boosting
- Seaborn for statistical visualizations
- LangChain for large language model integration
- LangSmith for LLM call tracking and monitoring
- Groq for high-performance computing
Made with โค๏ธ by Akash Anandani