Spaces:

Pulastya0
/

Data-Science-Agent

Running

App Files Files Community

Data-Science-Agent / README.md

Pulastya B

Fixed all output path issues

2cf9e11 about 1 month ago

preview code

raw

history blame contribute delete

14.3 kB

metadata

title: Data Science Agent
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
python_version: '3.12'
app_file: src/api/app.py
pinned: false
license: mit

Data Science Agent 🤖

An intelligent multi-agent AI system for automated end-to-end data science workflows. Upload any dataset and watch the agent autonomously profile, clean, engineer features, train models, and generate insights—all through natural language.

✨ Key Features

🧠 Multi-Agent Architecture

5 Specialist Agents: EDA, ML Modeling, Data Engineering, Visualization, Business Insights
Semantic Routing: SBERT-powered agent selection based on query intent
Autonomous Workflows: Full ML pipeline completion without manual intervention

📊 Complete ML Pipeline

Data Profiling: YData profiling, statistical analysis, data quality reports
Data Cleaning: Missing values, outliers, type conversion, deduplication
Feature Engineering: 50+ feature types (time, interactions, aggregations, encodings)
Model Training: 6 baseline models (Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost)
Hyperparameter Tuning: Optuna-based optimization with early stopping
Visualizations: Plotly dashboards, matplotlib plots, feature importance, residuals

🔧 Production-Ready Features

Real-time Progress: SSE streaming for live workflow updates
Session Memory: Maintains context across follow-up queries
Error Recovery: Graceful fallbacks and parameter validation
Large Dataset Support: Automatic sampling for 100K+ row datasets
HuggingFace Export: Export datasets, models, and outputs directly to your HuggingFace repos

🔐 Authentication & Integration

Supabase Auth: Secure user authentication with email/password and OAuth
HuggingFace Integration: Connect your HF account to export artifacts
Personal Token Support: Use your own HF write tokens for private uploads

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     React Frontend                          │
│              (Upload Dataset + Chat Interface)              │
└─────────────────────────┬───────────────────────────────────┘
                          │ SSE Stream
┌─────────────────────────▼───────────────────────────────────┐
│                    FastAPI Server                           │
│                    (Port 7860)                              │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                     Orchestrator                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Intent    │  │   Agent     │  │    Conversation     │  │
│  │  Detection  │──│  Selection  │──│      Pruning        │  │
│  │             │  │   (SBERT)   │  │  (12 exchanges)     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                  5 Specialist Agents                        │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐   │
│  │    EDA    │ │ Modeling  │ │   Data    │ │   Viz     │   │
│  │   Agent   │ │  Agent    │ │Engineering│ │  Agent    │   │
│  └───────────┘ └───────────┘ └───────────┘ └───────────┘   │
│                      ┌───────────┐                          │
│                      │ Insights  │                          │
│                      │   Agent   │                          │
│                      └───────────┘                          │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                    50+ Tools                                │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Data Profiling │ Feature Engineering │ Model Training│   │
│  │ Data Cleaning  │ Visualizations      │ NLP Analytics │   │
│  │ Time Series    │ Computer Vision     │ Business Intel│   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

🚀 Quick Start

Usage

Upload your CSV/Excel/Parquet dataset
Ask in natural language: "Analyze this dataset and predict the target column"
Watch the agent autonomously execute the full ML pipeline
Review generated visualizations, model metrics, and insights

Example Queries

"Profile this dataset and show data quality issues"
"Train models to predict the 'price' column"
"Generate feature importance visualizations"
"What are the key insights from this analysis?"

HuggingFace Export

Connect your HuggingFace account via Settings → Add your HF token
Generate artifacts (datasets, models, visualizations)
Export directly to your HuggingFace repos from the Assets sidebar
Share your work with the ML community

🛠️ Tech Stack

Component	Technology
LLM Provider	Mistral (mistral-large-latest) / Gemini / Groq
Backend	FastAPI + Python 3.12
Frontend	React 19 + TypeScript + Vite + Tailwind
Data Processing	Polars (primary) + Pandas (XGBoost compatibility)
ML Libraries	Scikit-learn, XGBoost, LightGBM, CatBoost
Hyperparameter Tuning	Optuna with MedianPruner
Semantic Search	Sentence-BERT (all-MiniLM-L6-v2)
Streaming	Server-Sent Events (SSE)
Authentication	Supabase Auth
Cloud Storage	HuggingFace Hub API

📁 Project Structure

src/
├── api/
│   └── app.py              # FastAPI endpoints + SSE streaming
├── orchestrator.py         # Main workflow orchestration (4500+ lines)
├── session_memory.py       # Context persistence across queries
├── session_store.py        # Session database management
├── storage/
│   ├── huggingface_storage.py  # HuggingFace Hub integration
│   └── artifact_store.py       # Local artifact management
├── tools/
│   ├── data_profiling.py   # YData profiling, statistics
│   ├── data_cleaning.py    # Missing values, outliers
│   ├── feature_engineering.py  # 50+ feature types
│   ├── model_training.py   # 6 baseline models + progress logging
│   ├── advanced_training.py    # Optuna hyperparameter tuning
│   ├── plotly_visualizations.py
│   ├── matplotlib_visualizations.py
│   └── tools_registry.py   # Tool definitions for LLM
├── reasoning/
│   ├── business_summary.py # Executive summaries
│   └── model_explanation.py    # Model interpretation
└── utils/
    ├── semantic_layer.py   # SBERT embeddings
    └── error_recovery.py   # Checkpoint management

⚙️ Configuration

Environment Variables

# Required - Choose one LLM provider
MISTRAL_API_KEY=your_mistral_key      # Recommended
GEMINI_API_KEY=your_gemini_key        # Alternative
GROQ_API_KEY=your_groq_key            # Alternative

# Optional
LLM_PROVIDER=mistral                  # mistral, gemini, or groq
MAX_ITERATIONS=20                     # Max workflow steps

# Supabase (for authentication)
SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_supabase_anon_key

HuggingFace Spaces

Set secrets in: Settings → Repository secrets

🖥️ Local Development

# Clone repository
git clone https://github.com/your-repo/data-science-agent
cd data-science-agent

# Install Python dependencies
pip install -r requirements.txt

# Install and build frontend
cd FRRONTEEEND && npm install && npm run build && cd ..

# Set API key
export MISTRAL_API_KEY=your_key_here

# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860

📊 Model Training Details

Baseline Models (Regression)

Model	Type	Key Features
Ridge	Linear	L2 regularization, fast
Lasso	Linear	L1 regularization, feature selection
Random Forest	Ensemble	Robust, feature importance
XGBoost	Gradient Boosting	High accuracy, GPU support
LightGBM	Gradient Boosting	Fast training, low memory
CatBoost	Gradient Boosting	Handles categoricals natively

Progress Logging

Real-time training progress with elapsed time:

🚀 Training 6 regression models on 140,757 samples...
[1/6] Training ridge... ✓ ridge trained in 2.3s
[2/6] Training lasso... ✓ lasso trained in 1.8s
[3/6] Training random_forest... ✓ random_forest trained in 45.2s
...
🏆 Best model: random_forest (R²=0.7585)

🔧 Recent Improvements

Workflow Reliability

✅ Autonomous Completion: Full ML pipeline without manual confirmation
✅ Smart Context Pruning: Keeps 12 exchanges (was 4) for better memory
✅ Target Column Persistence: Injected into workflow guidance after pruning
✅ Parameter Validation: Strips invalid LLM-hallucinated parameters

Performance

✅ Real-time Progress Logging: See model-by-model training status
✅ Large Dataset Sampling: Auto-sample to 50K rows for tuning
✅ Checkpoint Clearing: Fresh workflow for each new query

Error Handling

✅ SBERT Fallback: Graceful keyword routing if embeddings fail
✅ Tool Name Mapping: Maps 8+ common hallucinated tool names
✅ NoneType Safety: Validates all comparison operands

HuggingFace Integration

✅ One-Click Export: Export datasets, models, and outputs to HuggingFace
✅ Personal Repos: Auto-creates ds-agent-data, ds-agent-models, ds-agent-outputs repos
✅ Secure Tokens: User tokens stored securely in Supabase
✅ Status Caching: Efficient HF connection status checking

🐳 Docker Deployment

# Multi-stage build
FROM node:20-slim AS frontend
# Build React frontend

FROM python:3.12-slim AS backend
# Install Python dependencies + copy frontend build
EXPOSE 7860
CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]

📈 Performance Benchmarks

Dataset Size	Profiling	Training (6 models)	Total Workflow
10K rows	~5s	~30s	~2 min
50K rows	~15s	~2 min	~5 min
175K rows	~45s	~5 min	~10 min

🔮 Future Enhancements

We're actively working on exciting new features to make the Data Science Agent even more powerful:

🗄️ BigQuery Integration

Direct BigQuery Connection: Query and analyze massive datasets directly from Google BigQuery
Smart Sampling: Intelligent sampling strategies for billion-row tables
Cost Optimization: Query cost estimation before execution
Schema Discovery: Auto-detect tables, columns, and relationships

🔗 LangChain / LlamaIndex Compatibility

Framework Agnostic: Use as a tool within LangChain agents or LlamaIndex pipelines
Custom Tool Registration: Expose 50+ data science tools as LangChain tools
RAG Integration: Combine with document retrieval for context-aware analysis
Memory Backends: Support for LangChain memory stores and conversation history

💻 First-Class CLI Experience & Beautiful TUI

Rich Terminal UI: Interactive dashboards with progress bars, tables, and charts
Keyboard Navigation: Full workflow control without leaving the terminal
Pipeline Scripting: Define reproducible workflows in YAML/TOML
Offline Mode: Run locally without requiring a browser
SSH-Friendly: Perfect for remote server analysis

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

Built with ❤️ for autonomous data science