Spaces:
Running
Running
metadata
title: Data Science Agent
emoji: ๐ค
colorFrom: blue
colorTo: purple
sdk: docker
python_version: '3.12'
app_file: src/api/app.py
pinned: false
license: mit
Data Science Agent ๐ค
An intelligent multi-agent AI system for automated end-to-end data science workflows. Upload any dataset and watch the agent autonomously profile, clean, engineer features, train models, and generate insightsโall through natural language.
โจ Key Features
๐ง Multi-Agent Architecture
- 5 Specialist Agents: EDA, ML Modeling, Data Engineering, Visualization, Business Insights
- Semantic Routing: SBERT-powered agent selection based on query intent
- Autonomous Workflows: Full ML pipeline completion without manual intervention
๐ Complete ML Pipeline
- Data Profiling: YData profiling, statistical analysis, data quality reports
- Data Cleaning: Missing values, outliers, type conversion, deduplication
- Feature Engineering: 50+ feature types (time, interactions, aggregations, encodings)
- Model Training: 6 baseline models (Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost)
- Hyperparameter Tuning: Optuna-based optimization with early stopping
- Visualizations: Plotly dashboards, matplotlib plots, feature importance, residuals
๐ง Production-Ready Features
- Real-time Progress: SSE streaming for live workflow updates
- Session Memory: Maintains context across follow-up queries
- Error Recovery: Graceful fallbacks and parameter validation
- Large Dataset Support: Automatic sampling for 100K+ row datasets
- HuggingFace Export: Export datasets, models, and outputs directly to your HuggingFace repos
๐ Authentication & Integration
- Supabase Auth: Secure user authentication with email/password and OAuth
- HuggingFace Integration: Connect your HF account to export artifacts
- Personal Token Support: Use your own HF write tokens for private uploads
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ React Frontend โ
โ (Upload Dataset + Chat Interface) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SSE Stream
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FastAPI Server โ
โ (Port 7860) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Orchestrator โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Intent โ โ Agent โ โ Conversation โ โ
โ โ Detection โโโโ Selection โโโโ Pruning โ โ
โ โ โ โ (SBERT) โ โ (12 exchanges) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5 Specialist Agents โ
โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โ EDA โ โ Modeling โ โ Data โ โ Viz โ โ
โ โ Agent โ โ Agent โ โEngineeringโ โ Agent โ โ
โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโ โ
โ โ Insights โ โ
โ โ Agent โ โ
โ โโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 50+ Tools โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Data Profiling โ Feature Engineering โ Model Trainingโ โ
โ โ Data Cleaning โ Visualizations โ NLP Analytics โ โ
โ โ Time Series โ Computer Vision โ Business Intelโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Quick Start
Usage
- Upload your CSV/Excel/Parquet dataset
- Ask in natural language: "Analyze this dataset and predict the target column"
- Watch the agent autonomously execute the full ML pipeline
- Review generated visualizations, model metrics, and insights
Example Queries
"Profile this dataset and show data quality issues"
"Train models to predict the 'price' column"
"Generate feature importance visualizations"
"What are the key insights from this analysis?"
HuggingFace Export
- Connect your HuggingFace account via Settings โ Add your HF token
- Generate artifacts (datasets, models, visualizations)
- Export directly to your HuggingFace repos from the Assets sidebar
- Share your work with the ML community
๐ ๏ธ Tech Stack
| Component | Technology |
|---|---|
| LLM Provider | Mistral (mistral-large-latest) / Gemini / Groq |
| Backend | FastAPI + Python 3.12 |
| Frontend | React 19 + TypeScript + Vite + Tailwind |
| Data Processing | Polars (primary) + Pandas (XGBoost compatibility) |
| ML Libraries | Scikit-learn, XGBoost, LightGBM, CatBoost |
| Hyperparameter Tuning | Optuna with MedianPruner |
| Semantic Search | Sentence-BERT (all-MiniLM-L6-v2) |
| Streaming | Server-Sent Events (SSE) |
| Authentication | Supabase Auth |
| Cloud Storage | HuggingFace Hub API |
๐ Project Structure
src/
โโโ api/
โ โโโ app.py # FastAPI endpoints + SSE streaming
โโโ orchestrator.py # Main workflow orchestration (4500+ lines)
โโโ session_memory.py # Context persistence across queries
โโโ session_store.py # Session database management
โโโ storage/
โ โโโ huggingface_storage.py # HuggingFace Hub integration
โ โโโ artifact_store.py # Local artifact management
โโโ tools/
โ โโโ data_profiling.py # YData profiling, statistics
โ โโโ data_cleaning.py # Missing values, outliers
โ โโโ feature_engineering.py # 50+ feature types
โ โโโ model_training.py # 6 baseline models + progress logging
โ โโโ advanced_training.py # Optuna hyperparameter tuning
โ โโโ plotly_visualizations.py
โ โโโ matplotlib_visualizations.py
โ โโโ tools_registry.py # Tool definitions for LLM
โโโ reasoning/
โ โโโ business_summary.py # Executive summaries
โ โโโ model_explanation.py # Model interpretation
โโโ utils/
โโโ semantic_layer.py # SBERT embeddings
โโโ error_recovery.py # Checkpoint management
โ๏ธ Configuration
Environment Variables
# Required - Choose one LLM provider
MISTRAL_API_KEY=your_mistral_key # Recommended
GEMINI_API_KEY=your_gemini_key # Alternative
GROQ_API_KEY=your_groq_key # Alternative
# Optional
LLM_PROVIDER=mistral # mistral, gemini, or groq
MAX_ITERATIONS=20 # Max workflow steps
# Supabase (for authentication)
SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_supabase_anon_key
HuggingFace Spaces
Set secrets in: Settings โ Repository secrets
๐ฅ๏ธ Local Development
# Clone repository
git clone https://github.com/your-repo/data-science-agent
cd data-science-agent
# Install Python dependencies
pip install -r requirements.txt
# Install and build frontend
cd FRRONTEEEND && npm install && npm run build && cd ..
# Set API key
export MISTRAL_API_KEY=your_key_here
# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860
๐ Model Training Details
Baseline Models (Regression)
| Model | Type | Key Features |
|---|---|---|
| Ridge | Linear | L2 regularization, fast |
| Lasso | Linear | L1 regularization, feature selection |
| Random Forest | Ensemble | Robust, feature importance |
| XGBoost | Gradient Boosting | High accuracy, GPU support |
| LightGBM | Gradient Boosting | Fast training, low memory |
| CatBoost | Gradient Boosting | Handles categoricals natively |
Progress Logging
Real-time training progress with elapsed time:
๐ Training 6 regression models on 140,757 samples...
[1/6] Training ridge... โ ridge trained in 2.3s
[2/6] Training lasso... โ lasso trained in 1.8s
[3/6] Training random_forest... โ random_forest trained in 45.2s
...
๐ Best model: random_forest (Rยฒ=0.7585)
๐ง Recent Improvements
Workflow Reliability
- โ Autonomous Completion: Full ML pipeline without manual confirmation
- โ Smart Context Pruning: Keeps 12 exchanges (was 4) for better memory
- โ Target Column Persistence: Injected into workflow guidance after pruning
- โ Parameter Validation: Strips invalid LLM-hallucinated parameters
Performance
- โ Real-time Progress Logging: See model-by-model training status
- โ Large Dataset Sampling: Auto-sample to 50K rows for tuning
- โ Checkpoint Clearing: Fresh workflow for each new query
Error Handling
- โ SBERT Fallback: Graceful keyword routing if embeddings fail
- โ Tool Name Mapping: Maps 8+ common hallucinated tool names
- โ NoneType Safety: Validates all comparison operands
HuggingFace Integration
- โ One-Click Export: Export datasets, models, and outputs to HuggingFace
- โ
Personal Repos: Auto-creates
ds-agent-data,ds-agent-models,ds-agent-outputsrepos - โ Secure Tokens: User tokens stored securely in Supabase
- โ Status Caching: Efficient HF connection status checking
๐ณ Docker Deployment
# Multi-stage build
FROM node:20-slim AS frontend
# Build React frontend
FROM python:3.12-slim AS backend
# Install Python dependencies + copy frontend build
EXPOSE 7860
CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]
๐ Performance Benchmarks
| Dataset Size | Profiling | Training (6 models) | Total Workflow |
|---|---|---|---|
| 10K rows | ~5s | ~30s | ~2 min |
| 50K rows | ~15s | ~2 min | ~5 min |
| 175K rows | ~45s | ~5 min | ~10 min |
๐ฎ Future Enhancements
We're actively working on exciting new features to make the Data Science Agent even more powerful:
๐๏ธ BigQuery Integration
- Direct BigQuery Connection: Query and analyze massive datasets directly from Google BigQuery
- Smart Sampling: Intelligent sampling strategies for billion-row tables
- Cost Optimization: Query cost estimation before execution
- Schema Discovery: Auto-detect tables, columns, and relationships
๐ LangChain / LlamaIndex Compatibility
- Framework Agnostic: Use as a tool within LangChain agents or LlamaIndex pipelines
- Custom Tool Registration: Expose 50+ data science tools as LangChain tools
- RAG Integration: Combine with document retrieval for context-aware analysis
- Memory Backends: Support for LangChain memory stores and conversation history
๐ป First-Class CLI Experience & Beautiful TUI
- Rich Terminal UI: Interactive dashboards with progress bars, tables, and charts
- Keyboard Navigation: Full workflow control without leaving the terminal
- Pipeline Scripting: Define reproducible workflows in YAML/TOML
- Offline Mode: Run locally without requiring a browser
- SSH-Friendly: Perfect for remote server analysis
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
Built with โค๏ธ for autonomous data science