File size: 12,118 Bytes
7165325
 
 
 
 
 
 
 
6c4eaeb
7165325
 
6c4eaeb
 
7165325
0998348
d4ee3e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c203e2
d4ee3e0
 
 
 
 
 
 
 
 
 
 
0f835a2
d4ee3e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
---
title: AutoML
emoji: ๐Ÿฆ€
colorFrom: blue
colorTo: pink
sdk: streamlit
sdk_version: 1.44.0
app_file: app.py
pinned: true
license: mit
short_description: Automated Machine Learning platform
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/66c623e4c36beb1532189397/Hp59Si4oWEY4X4D95ZPRU.png
---

<!-- Custom header with green glow effect -->
<p align="center">
  <img src="header.svg" alt="AutoML - Automated Machine Learning Platform" width="800" />
</p>

<p>
<p align="center">
  <a href="https://github.com/username/Auto-ML/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT"></a>
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg" alt="Made with Python"></a>
  <a href="https://streamlit.io/"><img src="https://img.shields.io/badge/Made%20with-Streamlit-FF4B4B.svg" alt="Made with Streamlit"></a>
  <a href="https://scikit-learn.org/"><img src="https://img.shields.io/badge/Made%20with-Scikit--Learn-F7931E.svg" alt="Made with Scikit-Learn"></a>
</p>

<p align="center">
  <a href="https://pandas.pydata.org/"><img src="https://img.shields.io/badge/Made%20with-Pandas-150458.svg" alt="Made with Pandas"></a>
  <a href="https://numpy.org/"><img src="https://img.shields.io/badge/Made%20with-NumPy-013243.svg" alt="Made with NumPy"></a>
  <a href="https://matplotlib.org/"><img src="https://img.shields.io/badge/Made%20with-Matplotlib-11557c.svg" alt="Made with Matplotlib"></a>
  <a href="https://seaborn.pydata.org/"><img src="https://img.shields.io/badge/Made%20with-Seaborn-3776AB.svg" alt="Made with Seaborn"></a>
  <a href="https://plotly.com/"><img src="https://img.shields.io/badge/Made%20with-Plotly-3F4F75.svg" alt="Made with Plotly"></a>
  <a href="https://xgboost.readthedocs.io/"><img src="https://img.shields.io/badge/Made%20with-XGBoost-0073B7.svg" alt="Made with XGBoost"></a>
</p>

<p align="center">
  <a href="https://python.langchain.com/"><img src="https://img.shields.io/badge/Made%20with-LangChain-00A86B.svg" alt="Made with LangChain"></a>
  <a href="https://smith.langchain.com/"><img src="https://img.shields.io/badge/Monitored%20with-LangSmith-7742DD.svg" alt="Monitored with LangSmith"></a>
  <a href="https://ai.google.dev/"><img src="https://img.shields.io/badge/Powered%20by-Google%20Gemini-4285F4.svg" alt="Powered by Google Gemini"></a>
  <a href="https://groq.com/"><img src="https://img.shields.io/badge/Powered%20by-Groq-6236FF.svg" alt="Powered by Groq"></a>
  <a href="https://www.python-dotenv.org/"><img src="https://img.shields.io/badge/Made%20with-python--dotenv-2E7D32.svg" alt="Made with python-dotenv"></a>
  <a href="https://pickle.readthedocs.io/"><img src="https://img.shields.io/badge/Uses-pickle-8BC34A.svg" alt="Uses pickle"></a>
</p>

<p align="center">
  <b>AutoML</b> is a powerful tool for automating the end-to-end process of applying machine learning to real-world problems. It simplifies the process of model selection, hyperparameter tuning, and downloading, making machine learning accessible to everyone.
</p>

## ๐Ÿ”— Live Demo

<p align="center">
  <a href="https://huggingface.co/spaces/kashh65/AutoML" target="_blank">
    <img src="https://img.shields.io/badge/Try%20the%20Demo-00B8D9?style=for-the-badge&logo=streamlit&logoColor=white" alt="Try the Demo" />
  </a>
</p>

<p align="center">
  Check out the live demo of AutoML and experience the power of automated machine learning firsthand!
</p>

## ๐ŸŽฌ Video Showcase

<p align="center">
  <img src="automl-gif.gif" alt="AutoML Demonstration" width="800">
</p>

<p align="center">
  <em>See AutoML in action: This demonstration shows how to analyze data, train models, and get AI-powered insights in minutes!</em>
</p>

## โœจ Features

- ๐Ÿ“Š **Data Visualization and Analysis**: Interactive visualizations to understand your data
  - Correlation heatmaps
  - Distribution plots
  - Feature importance charts
  - Pair plots for relationship analysis
  
- ๐Ÿงน **Automated Data Cleaning and Preprocessing**: Handle missing values, outliers, and feature engineering
  - Automatic detection and handling of missing values
  - Outlier detection and treatment
  - Feature scaling and normalization
  - Categorical encoding (One-Hot, Label, Target encoding)
  
- ๐Ÿค– **Multiple ML Model Selection**: Choose from a variety of models or let AutoML select the best one
  - Classification models: Logistic Regression, Random Forest, XGBoost, SVC, Decision Tree, KNN, Gradient Boosting, AdaBoost, Gaussian Naive Bayes, QDA, LDA
  - Regression models: Linear Regression, Random Forest, XGBoost, SVR, Decision Tree, KNN, ElasticNet, Gradient Boosting, AdaBoost, Bayesian Ridge, Ridge, Lasso
  
- โš™๏ธ **Hyperparameter Tuning**: Optimize model performance with advanced tuning techniques
  - Added Support for 20+ Models to easily fine tune hyperparameters
  - Added Support for 10+ Hyperparameter Tuning Techniques
  
  
- ๐Ÿ“ˆ **Model Performance Evaluation**: Comprehensive metrics and visualizations
  - Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
  - Regression: MAE, MSE, RMSE, Rยฒ, Residual Plots
  
- ๐Ÿ” **AI-powered Data Insights**: Leverage Google's Gemini for intelligent data analysis
  - Natural language explanations of model decisions
  - Automated feature importance interpretation
  - Data quality assessment
  - Trend identification and anomaly detection

- ๐Ÿง  **LLM Fine-Tuning and Download**: Access and utilize pre-trained language models
  - Download fine-tuned LLMs for specific domains
  - Customize existing models for your specific use case
  - Access to various model sizes (small, medium, large)
  - Seamless integration with your data processing pipeline

## ๐Ÿš€ Installation

### Prerequisites

- Python 3.8 or higher
- Google API key for Gemini for data insights and dataframe cleaning
- Groq API key for LLM based test results analysis
- langsmith API for monitoring llm calls

### Setup

1. Clone the repository:
```bash
git clone <repository-url>
cd Auto-ML
```

2. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Set up your environment variables:
```bash
# Create a .env file with your Google API key as well as other keys
echo "GOOGLE_API_KEY=your_api_key_here" > .env
```

## ๐ŸŽฎ Usage

Start the application:

```bash
streamlit run app.py
```

### Quick Start Guide

1. **Upload Data**: Upload your CSV file
   - Supported format: CSV
   - Automatic data type detection
   - Preview of first few rows

2. **Explore Data**: Visualize and understand your dataset
   - Summary statistics
   - Correlation analysis
   - Distribution visualization
   - Missing value analysis

3. **Preprocess**: Clean and transform your data
   - Handle missing values (imputation strategies)
   - Remove or transform outliers
   - Feature scaling options
   - Encoding categorical variables

4. **Train Models**: Select models and tune hyperparameters
   - Choose target variable and features
   - Select machine learning algorithms
   - Configure hyperparameter search space
   - Set evaluation metrics

5. **Evaluate**: Compare model performance
   - Performance metrics visualization
   - Feature importance analysis
   - Model comparison dashboard
   - Cross-validation results

6. **Deploy**: Export your model 
   - Download trained model as pickle file



   
## ๐Ÿงฉ Project Structure

```
Auto-ML/
โ”œโ”€โ”€ app.py                  # Main Streamlit application
โ”œโ”€โ”€ requirements.txt        # Project dependencies
โ”œโ”€โ”€ .env                    # Environment variables (API keys)
โ”œโ”€โ”€ README.md               # Project documentation
โ”œโ”€โ”€ models/                 # Saved model files
โ”œโ”€โ”€ logs/                   # Application logs
โ””โ”€โ”€ src/                    # Source code
    โ”œโ”€โ”€ __init__.py         # Package initialization
    โ”œโ”€โ”€ preprocessing/      # Data preprocessing modules
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Data cleaning, transformation
    โ”œโ”€โ”€ training/           # Model training modules
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Model training, evaluation
    โ”œโ”€โ”€ ui/                 # User interface components
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ””โ”€โ”€ ...             # Streamlit UI elements
    โ””โ”€โ”€ utils/              # Utility functions
        โ”œโ”€โ”€ __init__.py
        โ””โ”€โ”€ ...             # Helper functions
```



# Preprocessing Pipelines

1\. Data Ingestion Pipeline
---------------------------

**Purpose:** Collects raw data from multiple sources (CSV, databases, APIs).

*   Reads structured/unstructured data
*   Handles missing values and duplicates
*   Converts raw data into a clean DataFrame

2\. Data Cleaning & Preprocessing Pipeline
------------------------------------------

**Purpose:** Transforms raw data into a machine-learning-ready format.

*   **Cleans Data:** Handles NaNs, outliers, and standardizes columns
*   **Encodes Categorical Features:** One-hot encoding, label encoding
*   **Scales Numerical Data:** MinMaxScaler, StandardScaler




3\. Model Selection & Training Pipeline
---------------------------------------

**Purpose:** Automates the process of selecting and training.

*   **Multiple Algorithms:** Trains XGBoost, RandomForest, Deep Learning models
*   **Hyperparameter Optimization:** Finds the best config for each model



6\. Model Deployment Pipeline
-----------------------------

**Purpose:** Makes the model available for real-world usage.

*   Exports the Model (Pickle, ONNX, TensorFlow SavedModel)
*   Easily Download after training



# Feedback and Fallback Mechanism

AutoML implements a robust feedback and fallback system to ensure reliability:

1. **Data Cleaning Validation**: The system validates all cleaning operations and provides feedback on the changes made
   - Automatic detection of cleaning effectiveness
   - Detailed logs of transformations applied to the data

2. **LLM Fallback Mechanism**: For AI-powered insights and data analysis
   - Primary attempt uses advanced LLMs (Google Gemini/Groq)
   - Automatic fallback to rule-based algorithms if LLM fails
   - Graceful degradation to ensure core functionality remains available
   - Error logging and reporting for continuous improvement
   - LangSmith integration for monitoring and tracking all LLM calls

3. **Error Feedback Loop**: Intelligent error handling during data cleaning
   - Automatically captures errors that occur during data cleaning operations
   - Sends error context to LLM to generate refined cleaning code
   - Re-executes the improved cleaning process
   - Iterative refinement ensures robust data preparation even with challenging datasets

## ๐Ÿค Contributing

We welcome contributions! 

### Development Setup

1. Fork the repository
2. Create a feature branch
3. Install development dependencies:
   ```bash
   pip install -r requirements-dev.txt
   ```
4. Make your changes
5. Run tests:
   ```bash
   pytest
   ```
6. Submit a pull request

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Acknowledgements

- [Streamlit](https://streamlit.io/) for the interactive web framework
- [Scikit-learn](https://scikit-learn.org/) for machine learning algorithms
- [Pandas](https://pandas.pydata.org/) for data manipulation
- [Plotly](https://plotly.com/) for interactive visualizations
- [Google Gemini](https://ai.google.dev/) for AI-powered insights
- [XGBoost](https://xgboost.readthedocs.io/) for gradient boosting
- [Seaborn](https://seaborn.pydata.org/) for statistical visualizations
- [LangChain](https://python.langchain.com/) for large language model integration
- [LangSmith](https://smith.langchain.com/) for LLM call tracking and monitoring
- [Groq](https://groq.com/) for high-performance computing

---

<p align="center">
  Made with โค๏ธ by Akash Anandani
</p>