Data-Science-Agent / README.md
Pulastya B
Fixed all output path issues
2cf9e11
metadata
title: Data Science Agent
emoji: ๐Ÿค–
colorFrom: blue
colorTo: purple
sdk: docker
python_version: '3.12'
app_file: src/api/app.py
pinned: false
license: mit

Data Science Agent ๐Ÿค–

An intelligent multi-agent AI system for automated end-to-end data science workflows. Upload any dataset and watch the agent autonomously profile, clean, engineer features, train models, and generate insightsโ€”all through natural language.

โœจ Key Features

๐Ÿง  Multi-Agent Architecture

  • 5 Specialist Agents: EDA, ML Modeling, Data Engineering, Visualization, Business Insights
  • Semantic Routing: SBERT-powered agent selection based on query intent
  • Autonomous Workflows: Full ML pipeline completion without manual intervention

๐Ÿ“Š Complete ML Pipeline

  • Data Profiling: YData profiling, statistical analysis, data quality reports
  • Data Cleaning: Missing values, outliers, type conversion, deduplication
  • Feature Engineering: 50+ feature types (time, interactions, aggregations, encodings)
  • Model Training: 6 baseline models (Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost)
  • Hyperparameter Tuning: Optuna-based optimization with early stopping
  • Visualizations: Plotly dashboards, matplotlib plots, feature importance, residuals

๐Ÿ”ง Production-Ready Features

  • Real-time Progress: SSE streaming for live workflow updates
  • Session Memory: Maintains context across follow-up queries
  • Error Recovery: Graceful fallbacks and parameter validation
  • Large Dataset Support: Automatic sampling for 100K+ row datasets
  • HuggingFace Export: Export datasets, models, and outputs directly to your HuggingFace repos

๐Ÿ” Authentication & Integration

  • Supabase Auth: Secure user authentication with email/password and OAuth
  • HuggingFace Integration: Connect your HF account to export artifacts
  • Personal Token Support: Use your own HF write tokens for private uploads

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     React Frontend                          โ”‚
โ”‚              (Upload Dataset + Chat Interface)              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚ SSE Stream
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    FastAPI Server                           โ”‚
โ”‚                    (Port 7860)                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Orchestrator                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Intent    โ”‚  โ”‚   Agent     โ”‚  โ”‚    Conversation     โ”‚  โ”‚
โ”‚  โ”‚  Detection  โ”‚โ”€โ”€โ”‚  Selection  โ”‚โ”€โ”€โ”‚      Pruning        โ”‚  โ”‚
โ”‚  โ”‚             โ”‚  โ”‚   (SBERT)   โ”‚  โ”‚  (12 exchanges)     โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  5 Specialist Agents                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚    EDA    โ”‚ โ”‚ Modeling  โ”‚ โ”‚   Data    โ”‚ โ”‚   Viz     โ”‚   โ”‚
โ”‚  โ”‚   Agent   โ”‚ โ”‚  Agent    โ”‚ โ”‚Engineeringโ”‚ โ”‚  Agent    โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                          โ”‚
โ”‚                      โ”‚ Insights  โ”‚                          โ”‚
โ”‚                      โ”‚   Agent   โ”‚                          โ”‚
โ”‚                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                          โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    50+ Tools                                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Data Profiling โ”‚ Feature Engineering โ”‚ Model Trainingโ”‚   โ”‚
โ”‚  โ”‚ Data Cleaning  โ”‚ Visualizations      โ”‚ NLP Analytics โ”‚   โ”‚
โ”‚  โ”‚ Time Series    โ”‚ Computer Vision     โ”‚ Business Intelโ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Usage

  1. Upload your CSV/Excel/Parquet dataset
  2. Ask in natural language: "Analyze this dataset and predict the target column"
  3. Watch the agent autonomously execute the full ML pipeline
  4. Review generated visualizations, model metrics, and insights

Example Queries

"Profile this dataset and show data quality issues"
"Train models to predict the 'price' column"
"Generate feature importance visualizations"
"What are the key insights from this analysis?"

HuggingFace Export

  1. Connect your HuggingFace account via Settings โ†’ Add your HF token
  2. Generate artifacts (datasets, models, visualizations)
  3. Export directly to your HuggingFace repos from the Assets sidebar
  4. Share your work with the ML community

๐Ÿ› ๏ธ Tech Stack

Component Technology
LLM Provider Mistral (mistral-large-latest) / Gemini / Groq
Backend FastAPI + Python 3.12
Frontend React 19 + TypeScript + Vite + Tailwind
Data Processing Polars (primary) + Pandas (XGBoost compatibility)
ML Libraries Scikit-learn, XGBoost, LightGBM, CatBoost
Hyperparameter Tuning Optuna with MedianPruner
Semantic Search Sentence-BERT (all-MiniLM-L6-v2)
Streaming Server-Sent Events (SSE)
Authentication Supabase Auth
Cloud Storage HuggingFace Hub API

๐Ÿ“ Project Structure

src/
โ”œโ”€โ”€ api/
โ”‚   โ””โ”€โ”€ app.py              # FastAPI endpoints + SSE streaming
โ”œโ”€โ”€ orchestrator.py         # Main workflow orchestration (4500+ lines)
โ”œโ”€โ”€ session_memory.py       # Context persistence across queries
โ”œโ”€โ”€ session_store.py        # Session database management
โ”œโ”€โ”€ storage/
โ”‚   โ”œโ”€โ”€ huggingface_storage.py  # HuggingFace Hub integration
โ”‚   โ””โ”€โ”€ artifact_store.py       # Local artifact management
โ”œโ”€โ”€ tools/
โ”‚   โ”œโ”€โ”€ data_profiling.py   # YData profiling, statistics
โ”‚   โ”œโ”€โ”€ data_cleaning.py    # Missing values, outliers
โ”‚   โ”œโ”€โ”€ feature_engineering.py  # 50+ feature types
โ”‚   โ”œโ”€โ”€ model_training.py   # 6 baseline models + progress logging
โ”‚   โ”œโ”€โ”€ advanced_training.py    # Optuna hyperparameter tuning
โ”‚   โ”œโ”€โ”€ plotly_visualizations.py
โ”‚   โ”œโ”€โ”€ matplotlib_visualizations.py
โ”‚   โ””โ”€โ”€ tools_registry.py   # Tool definitions for LLM
โ”œโ”€โ”€ reasoning/
โ”‚   โ”œโ”€โ”€ business_summary.py # Executive summaries
โ”‚   โ””โ”€โ”€ model_explanation.py    # Model interpretation
โ””โ”€โ”€ utils/
    โ”œโ”€โ”€ semantic_layer.py   # SBERT embeddings
    โ””โ”€โ”€ error_recovery.py   # Checkpoint management

โš™๏ธ Configuration

Environment Variables

# Required - Choose one LLM provider
MISTRAL_API_KEY=your_mistral_key      # Recommended
GEMINI_API_KEY=your_gemini_key        # Alternative
GROQ_API_KEY=your_groq_key            # Alternative

# Optional
LLM_PROVIDER=mistral                  # mistral, gemini, or groq
MAX_ITERATIONS=20                     # Max workflow steps

# Supabase (for authentication)
SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_supabase_anon_key

HuggingFace Spaces

Set secrets in: Settings โ†’ Repository secrets

๐Ÿ–ฅ๏ธ Local Development

# Clone repository
git clone https://github.com/your-repo/data-science-agent
cd data-science-agent

# Install Python dependencies
pip install -r requirements.txt

# Install and build frontend
cd FRRONTEEEND && npm install && npm run build && cd ..

# Set API key
export MISTRAL_API_KEY=your_key_here

# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860

๐Ÿ“Š Model Training Details

Baseline Models (Regression)

Model Type Key Features
Ridge Linear L2 regularization, fast
Lasso Linear L1 regularization, feature selection
Random Forest Ensemble Robust, feature importance
XGBoost Gradient Boosting High accuracy, GPU support
LightGBM Gradient Boosting Fast training, low memory
CatBoost Gradient Boosting Handles categoricals natively

Progress Logging

Real-time training progress with elapsed time:

๐Ÿš€ Training 6 regression models on 140,757 samples...
[1/6] Training ridge... โœ“ ridge trained in 2.3s
[2/6] Training lasso... โœ“ lasso trained in 1.8s
[3/6] Training random_forest... โœ“ random_forest trained in 45.2s
...
๐Ÿ† Best model: random_forest (Rยฒ=0.7585)

๐Ÿ”ง Recent Improvements

Workflow Reliability

  • โœ… Autonomous Completion: Full ML pipeline without manual confirmation
  • โœ… Smart Context Pruning: Keeps 12 exchanges (was 4) for better memory
  • โœ… Target Column Persistence: Injected into workflow guidance after pruning
  • โœ… Parameter Validation: Strips invalid LLM-hallucinated parameters

Performance

  • โœ… Real-time Progress Logging: See model-by-model training status
  • โœ… Large Dataset Sampling: Auto-sample to 50K rows for tuning
  • โœ… Checkpoint Clearing: Fresh workflow for each new query

Error Handling

  • โœ… SBERT Fallback: Graceful keyword routing if embeddings fail
  • โœ… Tool Name Mapping: Maps 8+ common hallucinated tool names
  • โœ… NoneType Safety: Validates all comparison operands

HuggingFace Integration

  • โœ… One-Click Export: Export datasets, models, and outputs to HuggingFace
  • โœ… Personal Repos: Auto-creates ds-agent-data, ds-agent-models, ds-agent-outputs repos
  • โœ… Secure Tokens: User tokens stored securely in Supabase
  • โœ… Status Caching: Efficient HF connection status checking

๐Ÿณ Docker Deployment

# Multi-stage build
FROM node:20-slim AS frontend
# Build React frontend

FROM python:3.12-slim AS backend
# Install Python dependencies + copy frontend build
EXPOSE 7860
CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]

๐Ÿ“ˆ Performance Benchmarks

Dataset Size Profiling Training (6 models) Total Workflow
10K rows ~5s ~30s ~2 min
50K rows ~15s ~2 min ~5 min
175K rows ~45s ~5 min ~10 min

๐Ÿ”ฎ Future Enhancements

We're actively working on exciting new features to make the Data Science Agent even more powerful:

๐Ÿ—„๏ธ BigQuery Integration

  • Direct BigQuery Connection: Query and analyze massive datasets directly from Google BigQuery
  • Smart Sampling: Intelligent sampling strategies for billion-row tables
  • Cost Optimization: Query cost estimation before execution
  • Schema Discovery: Auto-detect tables, columns, and relationships

๐Ÿ”— LangChain / LlamaIndex Compatibility

  • Framework Agnostic: Use as a tool within LangChain agents or LlamaIndex pipelines
  • Custom Tool Registration: Expose 50+ data science tools as LangChain tools
  • RAG Integration: Combine with document retrieval for context-aware analysis
  • Memory Backends: Support for LangChain memory stores and conversation history

๐Ÿ’ป First-Class CLI Experience & Beautiful TUI

  • Rich Terminal UI: Interactive dashboards with progress bars, tables, and charts
  • Keyboard Navigation: Full workflow control without leaving the terminal
  • Pipeline Scripting: Define reproducible workflows in YAML/TOML
  • Offline Mode: Run locally without requiring a browser
  • SSH-Friendly: Perfect for remote server analysis

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.


Built with โค๏ธ for autonomous data science