Spaces:
Running
title: NL2SQL Copilot โ Full Stack Demo
emoji: ๐งฉ
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
๐งฉ NL2SQL Copilot โ Natural-Language โ Safe SQL
Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI. Generates safe, verified, executable SQL through a multi-stage agentic pipeline. Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI.
๐ Live Demo: ๐ https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot
1) Quick Start
git clone https://github.com/melika-kheirieh/nl2sql-copilot
cd nl2sql-copilot
make setup # install dependencies
make run # start API + Gradio UI
Open:
- http://localhost:8000 (FastAPI Swagger UI)
- http://localhost:7860 (Gradio Demo)
2) Demo (Gradio UI)
The demo supports:
- Uploading any SQLite database
- Asking natural-language questions
- Viewing generated SQL
- Viewing query results
- Full multi-stage trace (detector โ planner โ generator โ safety โ executor โ verifier โ repair)
- Per-stage timings
- Example queries
- And a default demo DB (no upload required)
Everything runs on the same backend as the API.
3) Agentic Architecture
user query
โ
detector (ambiguity, missing info)
planner (schema reasoning + task decomposition)
generator (SQL generation)
safety (SELECT-only validation)
executor (sandboxed DB execution)
verifier (semantic + execution checks)
repair (minimal-diff SQL repair loop)
โ
final SQL + result + traces
โ๏ธ Tech Stack
- FastAPI
- Pydantic-AI
- SQLiteAdapter
- Prometheus + Grafana
- pytest + mypy + Makefile
- Gradio UI
The pipeline is fully modular: each stage has a clean, swappable interface.
4) Evolution (Prototype โ Copilot)
This project is the second-generation, production-grade version of an earlier prototype: ๐ https://github.com/melika-kheirieh/nl2sql-copilot-prototype
The prototype explored single-step, prompt-based SQL generation. The current version is a complete architectural redesign, adding:
- multi-stage agentic pipeline
- schema introspection
- safety guardrails
- self-repair loop
- caching
- observability
- Spider benchmarks
- multi-DB support with upload + TTL handling
This repository is the first end-to-end, production-oriented version.
5) Key Features
โ Agentic Pipeline
Planner โ Generator โ Safety โ Executor โ Verifier โ Repair.
โ Schema-Aware
Automatic schema preview for any uploaded SQLite database.
โ Safety by Design
- SELECT-only
- Column/table validation
- No multi-statement SQL
- Prevents schema hallucination
โ Self-Repair
Automatic minimal-diff correction when SQL fails.
โ Caching
TTL-based, with key = (db_id, normalized_query, schema_hash). Hit/miss metrics included.
โ Observability
- Per-stage latency
- Pipeline success ratio
- Repair success rate
- Cache hit ratio
- p95 latency
- Full Grafana dashboard
โ Benchmarks
Reproducible Spider evaluation with plots + summary.
6) Benchmarks (Spider dev, 20 samples)
Evaluated on a curated 20-sample subset of the Spider dev split
(focused on concert_singer), using the full production pipeline.
๐งฎ Summary
- Total samples: 20
- Successful runs: 20/20 (100%)
- Exact Match (EM): 0.10
- Structural Match (SM): 0.70
- Execution Accuracy: 0.725
This reflects a production-oriented NL2SQL system: the model optimizes for executable SQL, not literal gold-string alignment.
โฑ Latency
- Avg latency: ~8066 ms
- p50: ~9229 ms
- p95: ~14936 ms
Latency is bimodal: simple queries โ fast, reasoning-heavy queries โ planner-dominated.
โ๏ธ Per-Stage Latency
| Stage | Avg latency (ms) |
|---|---|
| detector | ~1 |
| planner | ~8360 |
| generator | ~1645 |
| safety | ~2 |
| executor | ~1 |
| verifier | ~1 |
| repair | ~1200 |
Planner is the main bottleneck (expected for schema-level reasoning). Safety/executor/verifier stay single-digit ms.
โ Failure Modes (Why EM is low)
Even when EM = 0, SM and ExecAcc are often 1.0.
Typical causes:
- Capitalization differences (
Agevsage) - Different column ordering
- LIMIT differences
- Alias mismatch
- Gold SQL is
EMPTYbut the model infers a valid SQL
In real-world systems, execution correctness matters more than exact string match.
๐ Reproducing the Benchmark
export SPIDER_ROOT="$PWD/data/spider"
PYTHONPATH=$PWD \
python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug
PYTHONPATH=$PWD \
python benchmarks/plot_results.py
Artifacts saved under:
benchmarks/results_pro/<timestamp>/
summary.json
eval.jsonl
metrics_overview.png
latency_histogram.png
latency_per_stage.png
errors_overview.png
7) API Usage
๐ NL โ SQL
curl -X POST "http://localhost:8000/api/v1/nl2sql" \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key" \
-d '{
"query": "Top 5 customers by total invoice amount",
"db_id": null
}'
โ Sample Response (accurate)
{
"ambiguous": false,
"sql": "SELECT ...",
"rationale": "Explanation of why this SQL was generated.",
"result": {
"rows": 5,
"columns": ["CustomerId", "Total"],
"rows_data": [
[1, 39.6],
[2, 38.7],
[3, 35.4]
]
},
"traces": [
{"stage": "detector", "duration_ms": 1},
{"stage": "planner", "duration_ms": 8943},
{"stage": "generator","duration_ms": 1722},
{"stage": "safety", "duration_ms": 2},
{"stage": "executor", "duration_ms": 1},
{"stage": "verifier", "duration_ms": 1},
{"stage": "repair", "duration_ms": 522}
]
}
๐ Upload a SQLite DB
curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \
-H "X-API-Key: dev-key" \
-F "file=@/path/to/db.sqlite"
๐ Schema Preview
curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \
-H "X-API-Key: dev-key"
8) Environment Variables
| Variable | Purpose |
|---|---|
API_KEYS |
Comma-separated list of backend API keys |
API_KEY |
Used by Gradio UI to call the backend |
DEV_MODE |
Enables strict ambiguity detection |
NL2SQL_CACHE_TTL_SEC |
Cache TTL |
NL2SQL_CACHE_MAX |
Max cache entries |
SPIDER_ROOT |
Path to Spider dataset |
USE_MOCK |
Skip execution (for testing) |
Gradio uses
API_KEYโ backend expects it asX-API-Key. Backend accepts multiple keys viaAPI_KEYS.
9) Future Work
1) Streaming SQL Generation (SSE)
2) Redis Distributed Cache
3) Multi-Model Planner/Generator
4) A/B Testing Framework
5) Schema Embeddings
6) Nightly CI Benchmarks
7) Advanced Repair (diff-based)
8) Helm / Compose Deployment Template
10) License
MIT License.