Spaces:

kashh65
/

AutoML

Running

App Files Files Community

AutoML / README.md

kashh65

Update README.md

0f835a2 verified about 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

	---
	title: AutoML
	emoji: 🦀
	colorFrom: blue
	colorTo: pink
	sdk: streamlit
	sdk_version: 1.44.0
	app_file: app.py
	pinned: true
	license: mit
	short_description: Automated Machine Learning platform
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/66c623e4c36beb1532189397/Hp59Si4oWEY4X4D95ZPRU.png
	---

	<!-- Custom header with green glow effect -->
	<p align="center">
	<img src="header.svg" alt="AutoML - Automated Machine Learning Platform" width="800" />
	</p>

	<p>
	<p align="center">
	<a href="https://github.com/username/Auto-ML/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT"></a>
	<a href="https://www.python.org/"><img src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg" alt="Made with Python"></a>
	<a href="https://streamlit.io/"><img src="https://img.shields.io/badge/Made%20with-Streamlit-FF4B4B.svg" alt="Made with Streamlit"></a>
	<a href="https://scikit-learn.org/"><img src="https://img.shields.io/badge/Made%20with-Scikit--Learn-F7931E.svg" alt="Made with Scikit-Learn"></a>
	</p>

	<p align="center">
	<a href="https://pandas.pydata.org/"><img src="https://img.shields.io/badge/Made%20with-Pandas-150458.svg" alt="Made with Pandas"></a>
	<a href="https://numpy.org/"><img src="https://img.shields.io/badge/Made%20with-NumPy-013243.svg" alt="Made with NumPy"></a>
	<a href="https://matplotlib.org/"><img src="https://img.shields.io/badge/Made%20with-Matplotlib-11557c.svg" alt="Made with Matplotlib"></a>
	<a href="https://seaborn.pydata.org/"><img src="https://img.shields.io/badge/Made%20with-Seaborn-3776AB.svg" alt="Made with Seaborn"></a>
	<a href="https://plotly.com/"><img src="https://img.shields.io/badge/Made%20with-Plotly-3F4F75.svg" alt="Made with Plotly"></a>
	<a href="https://xgboost.readthedocs.io/"><img src="https://img.shields.io/badge/Made%20with-XGBoost-0073B7.svg" alt="Made with XGBoost"></a>
	</p>

	<p align="center">
	<a href="https://python.langchain.com/"><img src="https://img.shields.io/badge/Made%20with-LangChain-00A86B.svg" alt="Made with LangChain"></a>
	<a href="https://smith.langchain.com/"><img src="https://img.shields.io/badge/Monitored%20with-LangSmith-7742DD.svg" alt="Monitored with LangSmith"></a>
	<a href="https://ai.google.dev/"><img src="https://img.shields.io/badge/Powered%20by-Google%20Gemini-4285F4.svg" alt="Powered by Google Gemini"></a>
	<a href="https://groq.com/"><img src="https://img.shields.io/badge/Powered%20by-Groq-6236FF.svg" alt="Powered by Groq"></a>
	<a href="https://www.python-dotenv.org/"><img src="https://img.shields.io/badge/Made%20with-python--dotenv-2E7D32.svg" alt="Made with python-dotenv"></a>
	<a href="https://pickle.readthedocs.io/"><img src="https://img.shields.io/badge/Uses-pickle-8BC34A.svg" alt="Uses pickle"></a>
	</p>

	<p align="center">
	<b>AutoML</b> is a powerful tool for automating the end-to-end process of applying machine learning to real-world problems. It simplifies the process of model selection, hyperparameter tuning, and downloading, making machine learning accessible to everyone.
	</p>

	## 🔗 Live Demo

	<p align="center">
	<a href="https://huggingface.co/spaces/kashh65/AutoML" target="_blank">
	<img src="https://img.shields.io/badge/Try%20the%20Demo-00B8D9?style=for-the-badge&logo=streamlit&logoColor=white" alt="Try the Demo" />
	</a>
	</p>

	<p align="center">
	Check out the live demo of AutoML and experience the power of automated machine learning firsthand!
	</p>

	## 🎬 Video Showcase

	<p align="center">
	<img src="automl-gif.gif" alt="AutoML Demonstration" width="800">
	</p>

	<p align="center">
	<em>See AutoML in action: This demonstration shows how to analyze data, train models, and get AI-powered insights in minutes!</em>
	</p>

	## ✨ Features

	- 📊 Data Visualization and Analysis: Interactive visualizations to understand your data
	- Correlation heatmaps
	- Distribution plots
	- Feature importance charts
	- Pair plots for relationship analysis

	- 🧹 Automated Data Cleaning and Preprocessing: Handle missing values, outliers, and feature engineering
	- Automatic detection and handling of missing values
	- Outlier detection and treatment
	- Feature scaling and normalization
	- Categorical encoding (One-Hot, Label, Target encoding)

	- 🤖 Multiple ML Model Selection: Choose from a variety of models or let AutoML select the best one
	- Classification models: Logistic Regression, Random Forest, XGBoost, SVC, Decision Tree, KNN, Gradient Boosting, AdaBoost, Gaussian Naive Bayes, QDA, LDA
	- Regression models: Linear Regression, Random Forest, XGBoost, SVR, Decision Tree, KNN, ElasticNet, Gradient Boosting, AdaBoost, Bayesian Ridge, Ridge, Lasso

	- ⚙️ Hyperparameter Tuning: Optimize model performance with advanced tuning techniques
	- Added Support for 20+ Models to easily fine tune hyperparameters
	- Added Support for 10+ Hyperparameter Tuning Techniques


	- 📈 Model Performance Evaluation: Comprehensive metrics and visualizations
	- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
	- Regression: MAE, MSE, RMSE, R², Residual Plots

	- 🔍 AI-powered Data Insights: Leverage Google's Gemini for intelligent data analysis
	- Natural language explanations of model decisions
	- Automated feature importance interpretation
	- Data quality assessment
	- Trend identification and anomaly detection

	- 🧠 LLM Fine-Tuning and Download: Access and utilize pre-trained language models
	- Download fine-tuned LLMs for specific domains
	- Customize existing models for your specific use case
	- Access to various model sizes (small, medium, large)
	- Seamless integration with your data processing pipeline

	## 🚀 Installation

	### Prerequisites

	- Python 3.8 or higher
	- Google API key for Gemini for data insights and dataframe cleaning
	- Groq API key for LLM based test results analysis
	- langsmith API for monitoring llm calls

	### Setup

	1. Clone the repository:
	```bash
	git clone <repository-url>
	cd Auto-ML
	```

	2. Create a virtual environment:
	```bash
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	4. Set up your environment variables:
	```bash
	# Create a .env file with your Google API key as well as other keys
	echo "GOOGLE_API_KEY=your_api_key_here" > .env
	```

	## 🎮 Usage

	Start the application:

	```bash
	streamlit run app.py
	```

	### Quick Start Guide

	1. Upload Data: Upload your CSV file
	- Supported format: CSV
	- Automatic data type detection
	- Preview of first few rows

	2. Explore Data: Visualize and understand your dataset
	- Summary statistics
	- Correlation analysis
	- Distribution visualization
	- Missing value analysis

	3. Preprocess: Clean and transform your data
	- Handle missing values (imputation strategies)
	- Remove or transform outliers
	- Feature scaling options
	- Encoding categorical variables

	4. Train Models: Select models and tune hyperparameters
	- Choose target variable and features
	- Select machine learning algorithms
	- Configure hyperparameter search space
	- Set evaluation metrics

	5. Evaluate: Compare model performance
	- Performance metrics visualization
	- Feature importance analysis
	- Model comparison dashboard
	- Cross-validation results

	6. Deploy: Export your model
	- Download trained model as pickle file




	## 🧩 Project Structure

	```
	Auto-ML/
	├── app.py # Main Streamlit application
	├── requirements.txt # Project dependencies
	├── .env # Environment variables (API keys)
	├── README.md # Project documentation
	├── models/ # Saved model files
	├── logs/ # Application logs
	└── src/ # Source code
	├── __init__.py # Package initialization
	├── preprocessing/ # Data preprocessing modules
	│ ├── __init__.py
	│ └── ... # Data cleaning, transformation
	├── training/ # Model training modules
	│ ├── __init__.py
	│ └── ... # Model training, evaluation
	├── ui/ # User interface components
	│ ├── __init__.py
	│ └── ... # Streamlit UI elements
	└── utils/ # Utility functions
	├── __init__.py
	└── ... # Helper functions
	```



	# Preprocessing Pipelines

	1\. Data Ingestion Pipeline
	---------------------------

	Purpose: Collects raw data from multiple sources (CSV, databases, APIs).

	* Reads structured/unstructured data
	* Handles missing values and duplicates
	* Converts raw data into a clean DataFrame

	2\. Data Cleaning & Preprocessing Pipeline
	------------------------------------------

	Purpose: Transforms raw data into a machine-learning-ready format.

	* Cleans Data: Handles NaNs, outliers, and standardizes columns
	* Encodes Categorical Features: One-hot encoding, label encoding
	* Scales Numerical Data: MinMaxScaler, StandardScaler




	3\. Model Selection & Training Pipeline
	---------------------------------------

	Purpose: Automates the process of selecting and training.

	* Multiple Algorithms: Trains XGBoost, RandomForest, Deep Learning models
	* Hyperparameter Optimization: Finds the best config for each model



	6\. Model Deployment Pipeline
	-----------------------------

	Purpose: Makes the model available for real-world usage.

	* Exports the Model (Pickle, ONNX, TensorFlow SavedModel)
	* Easily Download after training



	# Feedback and Fallback Mechanism

	AutoML implements a robust feedback and fallback system to ensure reliability:

	1. Data Cleaning Validation: The system validates all cleaning operations and provides feedback on the changes made
	- Automatic detection of cleaning effectiveness
	- Detailed logs of transformations applied to the data

	2. LLM Fallback Mechanism: For AI-powered insights and data analysis
	- Primary attempt uses advanced LLMs (Google Gemini/Groq)
	- Automatic fallback to rule-based algorithms if LLM fails
	- Graceful degradation to ensure core functionality remains available
	- Error logging and reporting for continuous improvement
	- LangSmith integration for monitoring and tracking all LLM calls

	3. Error Feedback Loop: Intelligent error handling during data cleaning
	- Automatically captures errors that occur during data cleaning operations
	- Sends error context to LLM to generate refined cleaning code
	- Re-executes the improved cleaning process
	- Iterative refinement ensures robust data preparation even with challenging datasets

	## 🤝 Contributing

	We welcome contributions!

	### Development Setup

	1. Fork the repository
	2. Create a feature branch
	3. Install development dependencies:
	```bash
	pip install -r requirements-dev.txt
	```
	4. Make your changes
	5. Run tests:
	```bash
	pytest
	```
	6. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## 🙏 Acknowledgements

	- [Streamlit](https://streamlit.io/) for the interactive web framework
	- [Scikit-learn](https://scikit-learn.org/) for machine learning algorithms
	- [Pandas](https://pandas.pydata.org/) for data manipulation
	- [Plotly](https://plotly.com/) for interactive visualizations
	- [Google Gemini](https://ai.google.dev/) for AI-powered insights
	- [XGBoost](https://xgboost.readthedocs.io/) for gradient boosting
	- [Seaborn](https://seaborn.pydata.org/) for statistical visualizations
	- [LangChain](https://python.langchain.com/) for large language model integration
	- [LangSmith](https://smith.langchain.com/) for LLM call tracking and monitoring
	- [Groq](https://groq.com/) for high-performance computing

	---

	<p align="center">
	Made with ❤️ by Akash Anandani
	</p>