File size: 4,543 Bytes
1db187e
6144ebf
ed8f946
6144ebf
 
 
 
 
 
1db187e
a0155bf
 
c7c0038
 
 
6b53acf
a0155bf
6b53acf
 
 
 
 
 
 
 
 
 
a4078cd
 
 
6b53acf
 
 
 
a0155bf
 
 
6b53acf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0155bf
6b53acf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4078cd
a0155bf
 
 
 
 
 
 
 
 
 
 
 
afc8b97
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: Insights
emoji: 📈
colorFrom: gray
colorTo: yellow
sdk: streamlit
sdk_version: 1.33.0
app_file: app.py
pinned: false
---
# Insights

## Deployment
[HuggingFace](https://huggingface.co/spaces/AtharvaThakur/Insights)

# Insights: Gen-AI Based Data Analysis Tool

## Overview

**Insights** is a state-of-the-art data analysis tool that leverages the Gemini-Pro large language model (LLM) to automate and enhance the data analysis process. This tool aims to perform end-to-end data analysis tasks, providing substantial cost and time savings while matching or exceeding the performance of junior data analysts.

## Table of Contents

1. [Introduction](#introduction)
2. [Features](#features)
3. [System Architecture](#system-architecture)
4. [Modules Overview](#modules-overview)
5. [Usage](#usage)
6. [Installation](#installation)
7. [License](#license)

## Introduction

In today's data-driven world, robust data analysis tools are crucial for informed decision-making and strategic planning. Traditional data analysis methods often face challenges such as time-consuming processes, potential for errors, and the need for specialized expertise. **Insights** addresses these issues by utilizing AI to streamline and enhance the data analysis process.

## Features

- **Automated Data Analysis**: Perform data collection, visualization, and analysis with minimal human intervention.
- **Advanced Summarization**: Generate detailed summaries and potential questions for datasets.
- **Exploratory Data Analysis (EDA)**: Tools for statistical summaries, distribution plots, and correlation matrices.
- **Data Cleaning and Transformation**: Functions for handling missing values, outlier detection, normalization, and feature engineering.
- **Machine Learning Toolkit**: Automates model selection, training, hyperparameter tuning, and evaluation.
- **Query Answering Module**: Generate Python code to answer user queries and produce visualizations.

## System Architecture

The **Insights** tool is built on the Gemini platform and consists of three main components:

1. **Summary Module**
2. **QA Module**
3. **Code Execution and Analysis Generation**

### Summary Module

Extracts essential details about the dataset and generates a comprehensive summary along with potential questions for further exploration.

### QA Module

Handles user queries related to the dataset, generating Python code to answer the queries and produce visualizations.

### Code Execution and Analysis Generation

Executes the generated Python code offline to ensure data security, producing detailed responses and visualizations.

## Modules Overview

### Summary Generation

1. **Information Extraction**: Extracts critical information from the dataset.
2. **Prompting Gemini**: Constructs a detailed prompt for Gemini to generate summaries and questions.
3. **Summary and Question Generation**: Generates a summary and potential questions for user review.

### Data Exploration

Includes tools for EDA, data cleaning, and data transformation.

### ML Toolkit

Facilitates the creation and evaluation of machine learning models on the dataset.

### QA Module

Allows users to query the dataset and receive answers along with visualizations. The process involves:

1. Accepting user queries.
2. Combining queries with dataset information.
3. Generating and executing Python code offline.
4. Producing visualizations and textual data.

### Analysis Generation

Processes the output from code execution to create concise and insightful responses.

## Usage
1. Initialize the Tool:
`python app.py`
2. Load Dataset: Upload your dataset when prompted.
3. Generate Summary: The tool will automatically generate a summary and potential questions.
4. Exploratory Data Analysis: Use the EDA tools to explore your dataset.
5. Query the Dataset: Enter your queries to receive answers and visualizations.
6. Analyze Results: Review the detailed analysis generated by the tool.

## Installation 

1. Install the required packages:
   The project's dependencies are listed in the 'requirements.txt' file. You can install all of them using pip:
   ```
   pip install -r requirements.txt
   ```
2. Run the application:
   Now, you're ready to run the application. Use the following command to start the Streamlit server:
   ```
   streamlit run app.py
   ```

# Running using Docker
1. Build the docker image using 
   ```
   docker build -t insights .
   ```
2. Run the Docker container with
   ```
   docker run -p 8501:8501 -e GOOGLE_API_KEY=<you-api-key> insights
   ```