Spaces:

hjerpe
/

sql_env

Runtime error

App Files Files Community

hjerpe commited on 5 days ago

Commit

5dd1bb4

verified ·

1 Parent(s): 34d2fe8

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

AGENTS.md +145 -0
CLAUDE.md +61 -0
Dockerfile +85 -0
GEMINI.md +62 -0
README.md +135 -5
REVIEW_REPORT.md +57 -0
__init__.py +36 -0
client.py +140 -0
conftest.py +3 -0
data/__init__.py +1 -0
data/databases/__init__.py +1 -0
data/databases/models.py +153 -0
data/questions/db_list.json +12 -0
data/questions/questions_eval.json +0 -0
data/questions/questions_train.json +0 -0
data/questions/student_assessment.json +3355 -0
docs/ARCHITECTURE.md +361 -0
docs/README.md +41 -0
docs/RUNBOOK.md +10 -0
docs/blog-outline.md +56 -0
docs/design-docs/decisions/0001-template.md +26 -0
docs/design-docs/index.md +57 -0
docs/guides/README.md +24 -0
docs/learnings/F007-architecture.md +1 -0
docs/learnings/F007-conventions.md +2 -0
docs/learnings/F007-gotchas.md +2 -0
docs/learnings/F007-integrations.md +2 -0
docs/learnings/F007-security.md +1 -0
docs/learnings/F007-testing.md +1 -0
docs/learnings/F007-workflow.md +1 -0
docs/references/README.md +5 -0
evaluation/__init__.py +11 -0
evaluation/green_agent.py +199 -0
models.py +272 -0
notebooks/train_grpo.ipynb +226 -0
opencode.jsonc +283 -0
openenv.yaml +6 -0
progress.log +29 -0
pyproject.toml +69 -0
scripts/curate_questions.py +921 -0
scripts/download_spider_data.py +106 -0
scripts/download_spider_databases.py +301 -0
scripts/generate_models_from_schema.py +294 -0
server/__init__.py +5 -0
server/app.py +110 -0
server/install_deps.sh +12 -0
server/requirements.txt +6 -0
server/reward.py +185 -0
server/sql_environment.py +635 -0
server/synthetic/__init__.py +25 -0

AGENTS.md ADDED Viewed

	@@ -0,0 +1,145 @@

+# Project Map (AGENTS.md)
+This file is a navigation map for agents. Durable knowledge lives in `docs/`.
+## Start Here
+- Docs index: [docs/README.md](docs/README.md)
+- Architecture: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
+- Operations: [docs/RUNBOOK.md](docs/RUNBOOK.md)
+- Test: `uv run pytest tests/ -v`
+## System-of-Record Documents
+| Category | Location | Type | Purpose |
+|----------|----------|------|---------|
+| Guides | [docs/guides/README.md](docs/guides/README.md) | how-to | Practical procedures |
+| Design docs | [docs/design-docs/index.md](docs/design-docs/index.md) | explanation | Feature design, ADRs |
+| References | [docs/references/README.md](docs/references/README.md) | reference | External docs |
+## Project Structure
+This project follows the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) `openenv init` convention.
+The project root **is** the environment package — no `envs/` nesting.
+```
+sql-env/                       # project root = environment package
+├── __init__.py                # exports SQLAction, SQLObservation, SQLEnvClient
+├── models.py                  # Pydantic models (action w/ tokens, observation w/ messages, state)
+├── client.py                  # SQLEnvClient(EnvClient) — WebSocket client w/ tensor serialization
+├── conftest.py                # pytest config (ignores __init__.py collection)
+├── openenv.yaml               # OpenEnv manifest
+├── pyproject.toml             # deps + package config (setuptools, torch, transformers)
+├── .python-version            # pins Python 3.12
+├── data/
+│   ├── databases/
+│   │   └── models.py          # SQLAlchemy ORM models (student_assessment)
+│   └── questions/
+│       └── student_assessment.json  # 30+ Spider Q&A pairs with gold SQL
+├── server/
+│   ├── app.py                 # FastAPI app (tokenizer factory, MockTokenizer fallback)
+│   ├── sql_environment.py     # SQLEnvironment(Environment) — core logic + Ollama
+│   ├── test_sql_env.py        # MockTokenizer (char-code encoding for dev/test)
+│   ├── reward.py              # Reward computation (stub — Phase 3)
+│   ├── verifier.py            # Answer comparison (stub — Phase 3)
+│   ├── Dockerfile
+│   ├── requirements.txt
+│   └── install_deps.sh        # Docker setup script
+├── scripts/
+│   ├── download_spider_data.py       # Download Spider questions from HuggingFace
+│   └── generate_models_from_schema.py # Auto-generate SQLAlchemy models
+├── tests/
+│   └── test_smoke.py          # 21 tests (models, env, actions, client, schema)
+├── docs/                      # Design docs, architecture
+└── AGENTS.md
+```
+## Guardrails
+- **Testing:** Use the package manager (`uv run pytest ...`), never bare `pytest`.
+- **Git safety:** No destructive commands (`reset --hard`, `push --force`) unless explicit.
+- **Secrets:** Never commit `.env` or credentials.
+## Quick Commands
+| Task | Command |
+|------|---------|
+| Install | `uv sync` |
+| Lint | `uv run ruff check --fix .` |
+| Format | `uv run ruff format .` |
+| Test | `uv run pytest tests/ -v` |
+| Run server | `uv run uvicorn server.app:app --reload` |
+| Validate env | `uv run openenv validate --verbose` |
+| Build Docker | `uv run openenv build` |
+| Push to HF | `uv run openenv push` |
+## Development Workflow
+- Run via package manager (`uv run ...`), never bare commands.
+- List existing files before creating new ones (avoid naming drift).
+- Prefer vertical slices over horizontal refactors.
+- No premature abstraction until multiple use-cases require it.
+<!-- GUIDELINES-BEGIN -->
+## Delivery Safety (Move Fast Without Breaking Things)
+Move fast by taking the smallest responsible step that produces real feedback, while pre-committing to guardrails so being wrong is survivable.
+- **Small batches:** Prefer vertical slices and small PRs; reduce blast radius and review/debug time.
+- **Define "broken" first:** Before shipping, write down what you will watch (errors, latency, correctness, cost) and the abort threshold.
+- **Design for reversibility:** Make changes easy to turn off, roll back, or ignore.
+## System Boundaries (Avoid Analysis Paralysis)
+Systems are continuous webs; plans require artificial boundaries.
+- **Boundary rule:** Include only variables/components that could change the decision you are making.
+- **Clouds:** Treat everything else as exogenous inputs; track them as risks/assumptions.
+- **Timebox mapping:** If the landscape is moving faster than you can model it, run a probe (spike, canary, A/B) instead.
+## Maturity Modes
+Match guardrails to maturity:
+- **Exploratory:** Learning > durability. Prefer spikes; avoid irreversible state changes; manual verification is OK; expect throwaway code.
+- **MVP:** Ship a thin end-to-end slice. Manual checks are OK, but you still need a fast rollback path and bounded impact.
+- **Production:** Build to last. Automated tests, observability, progressive rollout, and explicit rollback/incident posture.
+Expect limiting factors to move as you ship: fix the current bottleneck, then re-diagnose the next.
+## Progressive Delivery
+- **Feature flags:** Use flags to make risky changes reversible. Categorize flags (release/experiment/ops/permissioning).
+- **Flags are inventory:** Every flag needs an owner, an expiry, and a removal plan.
+- **Canary/ramp when risk is non-trivial:** Start small, watch signals, ramp gradually; prefer "flip off" over redeploy.
+## Reliability Control Loop (If You Run Production)
+- **SLO + error budget:** If you are within budget, keep shipping; if you burn budget, freeze non-critical changes and pay down reliability.
+## Avoid
+- Big-bang releases, long-lived branches, unowned flags, flaky tests, and alert noise.
+## Python Guidelines
+- Prefer type hints for public APIs; use `typing` / `collections.abc`.
+- Use NumPy-style docstrings; keep them synced with type hints.
+- Error handling: Use specific exceptions; avoid `try: ... except Exception: pass`.
+- Dependencies: Use `uv add <package>`; do not manually edit `pyproject.toml`.
+## Docs Expectations
+- Keep durable design/ops knowledge in `docs/` (architecture, runbook, decisions). Keep AGENTS.md as a short map, not an encyclopedia.
+## Testing Standards
+- **Always use the project's package manager** to run tests. Never invoke test runners directly.
+  - Python (uv): `uv run pytest tests/ -v` (NEVER bare `pytest`)
+  - Python (poetry): `poetry run pytest tests/ -v`
+  - Node: `npm test` or `npm run test`
+  - Rust: `cargo test`
+- **Rationale:** Bare `pytest` bypasses the virtualenv and may use the wrong Python/dependencies. Package managers ensure the correct environment. Bare invocations also trigger unnecessary permission prompts in automated workflows.
+<!-- GUIDELINES-END -->

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,61 @@

+# Project Map (AGENTS.md)
+This file is a navigation map for agents. Durable knowledge lives in `docs/`.
+## Start Here
+- Docs index: [docs/README.md](docs/README.md)
+- Architecture: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
+- Operations: [docs/RUNBOOK.md](docs/RUNBOOK.md)
+- Validate: `opencode-ctx docs validate`
+- Test: `uv run pytest tests/ -v`
+## System-of-Record Documents
+| Category | Location | Type | Purpose |
+|----------|----------|------|---------|
+| Guides | [docs/guides/README.md](docs/guides/README.md) | how-to | Practical procedures |
+| Design docs | [docs/design-docs/index.md](docs/design-docs/index.md) | explanation | Feature design, ADRs |
+| Core beliefs | [docs/design-docs/core-beliefs.md](docs/design-docs/core-beliefs.md) | explanation | Agent-first principles |
+| Learnings | [docs/learnings/README.md](docs/learnings/README.md) | reference | Durable patterns |
+| Exec plans | [docs/exec-plans/README.md](docs/exec-plans/README.md) | how-to | Complex work tracking |
+| Discovery | [docs/discovery/index.md](docs/discovery/index.md) | explanation | Validate + Taste |
+| Delivery specs | [docs/delivery-specs/index.md](docs/delivery-specs/index.md) | reference | Engineering handoff |
+| References | [docs/references/README.md](docs/references/README.md) | reference | External docs |
+| Exploration | [docs/exploration/README.md](docs/exploration/README.md) | exploration | Ideas, scratchpad |
+| Taxonomy | [docs/DOCS_TAXONOMY.md](docs/DOCS_TAXONOMY.md) | reference | Where to put new docs |
+| Quality | [docs/QUALITY_SCORE.md](docs/QUALITY_SCORE.md) | reference | Domain grades |
+## Guardrails
+- **Testing:** Use the package manager (`uv run pytest ...`), never bare `pytest`.
+- **Skills:** Call `skill({ name: "<name>" })` first when asked to use a skill.
+- **Config:** Project config in `opencode.jsonc` (repo root); `.opencode/` holds project agents/commands; global fallback in `~/.config/opencode/`.
+- **Git safety:** No destructive commands (`reset --hard`, `push --force`) unless explicit.
+- **Secrets:** Never commit `.env` or credentials.
+## Quick Commands
+| Task | Command |
+|------|---------|
+| Install | `uv sync` |
+| Docs validate | `opencode-ctx docs validate` |
+| Arch snapshot | `opencode-ctx docs architecture apply` |
+| Lint | `uv run ruff check --fix .` |
+| Format | `uv run ruff format .` |
+| Test | `uv run pytest tests/ -v` |
+| Run | `uv run python -m <module>` |
+## Development Workflow
+- Run via package manager (`uv run ...`), never bare commands.
+- List existing files before creating new ones (avoid naming drift).
+- Prefer vertical slices over horizontal refactors.
+- No premature abstraction until multiple use-cases require it.
+<!-- GUIDELINES-BEGIN -->
+<!-- Managed by: opencode-ctx guidelines apply --packs python,testing,delivery-safety -->
+<!-- Run the command above to populate this section -->
+<!-- GUIDELINES-END -->

Dockerfile ADDED Viewed

	@@ -0,0 +1,85 @@

+# Multi-stage build using openenv-base
+# Works for both in-repo and standalone environments.
+# The build script (openenv build) handles context detection.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for VCS dependencies)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=sql_env
+# Set to https://download.pytorch.org/whl/cpu for CPU-only (default, smaller image)
+# Set to "" for full CUDA support (GPU deployment)
+ARG TORCH_INDEX=https://download.pytorch.org/whl/cpu
+# Copy environment code
+COPY . /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies (TORCH_INDEX controls CPU vs CUDA PyTorch)
+RUN --mount=type=cache,target=/root/.cache/uv \
+    export UV_PROJECT_ENVIRONMENT=/app/.venv && \
+    if [ -n "${TORCH_INDEX}" ]; then export UV_EXTRA_INDEX_URL="${TORCH_INDEX}"; fi && \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    export UV_PROJECT_ENVIRONMENT=/app/.venv && \
+    if [ -n "${TORCH_INDEX}" ]; then export UV_EXTRA_INDEX_URL="${TORCH_INDEX}"; fi && \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Default port (HF Spaces overrides with PORT=7860)
+ENV PORT=8000
+# Copy the virtual environment from builder
+COPY --from=builder /app/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Explicitly copy bundled Spider databases for deployment checks
+COPY --from=builder /app/env/data/databases /app/env/data/databases
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Run as non-root for HF Spaces security best practice
+RUN useradd --create-home --uid 10001 appuser
+USER appuser
+# Health check verifies bundled DBs and API health
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD sh -c 'find /app/env/data/databases -name "*.sqlite" -print -quit | grep -q . && curl -f "http://localhost:${PORT:-8000}/health"' || exit 1
+# Run the FastAPI server
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port ${PORT:-8000}"]

GEMINI.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Project Map (AGENTS.md)
+This file is a navigation map for agents. Durable knowledge lives in `docs/`.
+## Start Here
+- Docs index: [docs/README.md](docs/README.md)
+- Architecture: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
+- Operations: [docs/RUNBOOK.md](docs/RUNBOOK.md)
+- Validate: `opencode-ctx docs validate`
+- Test: `uv run pytest tests/ -v`
+## System-of-Record Documents
+| Category | Location | Type | Purpose |
+|----------|----------|------|---------|
+| Guides | [docs/guides/README.md](docs/guides/README.md) | how-to | Practical procedures |
+| Design docs | [docs/design-docs/index.md](docs/design-docs/index.md) | explanation | Feature design, ADRs |
+| Core beliefs | [docs/design-docs/core-beliefs.md](docs/design-docs/core-beliefs.md) | explanation | Agent-first principles |
+| Learnings | [docs/learnings/README.md](docs/learnings/README.md) | reference | Durable patterns |
+| Exec plans | [docs/exec-plans/README.md](docs/exec-plans/README.md) | how-to | Complex work tracking |
+| Discovery | [docs/discovery/index.md](docs/discovery/index.md) | explanation | Validate + Taste |
+| Delivery specs | [docs/delivery-specs/index.md](docs/delivery-specs/index.md) | reference | Engineering handoff |
+| References | [docs/references/README.md](docs/references/README.md) | reference | External docs |
+| Exploration | [docs/exploration/README.md](docs/exploration/README.md) | exploration | Ideas, scratchpad |
+| Taxonomy | [docs/DOCS_TAXONOMY.md](docs/DOCS_TAXONOMY.md) | reference | Where to put new docs |
+| Quality | [docs/QUALITY_SCORE.md](docs/QUALITY_SCORE.md) | reference | Domain grades |
+## Guardrails
+- **Testing:** Use the package manager (`uv run pytest ...`), never bare `pytest`.
+- **Skills:** Call `skill({ name: "<name>" })` first when asked to use a skill.
+- **Config:** Project config in `opencode.jsonc` (repo root); `.opencode/` holds project agents/commands; global fallback in `~/.config/opencode/`.
+- **Git safety:** No destructive commands (`reset --hard`, `push --force`) unless explicit.
+- **Secrets:** Never commit `.env` or credentials.
+## Quick Commands
+| Task | Command |
+|------|---------|
+| Install | `uv sync` |
+| Init project | `opencode-ctx docs init` (scaffolds docs, config, git hooks) |
+| Docs validate | `opencode-ctx docs validate` |
+| Arch snapshot | `opencode-ctx docs architecture apply` |
+| Lint | `uv run ruff check --fix .` |
+| Format | `uv run ruff format .` |
+| Test | `uv run pytest tests/ -v` |
+| Run | `uv run python -m <module>` |
+## Development Workflow
+- Run via package manager (`uv run ...`), never bare commands.
+- List existing files before creating new ones (avoid naming drift).
+- Prefer vertical slices over horizontal refactors.
+- No premature abstraction until multiple use-cases require it.
+<!-- GUIDELINES-BEGIN -->
+<!-- Managed by: opencode-ctx guidelines apply --packs python,testing,delivery-safety -->
+<!-- Run the command above to populate this section -->
+<!-- GUIDELINES-END -->

README.md CHANGED Viewed

@@ -1,10 +1,140 @@
 ---
-title: Sql Env
-emoji: 🌍
-colorFrom: pink
-colorTo: red
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SQLEnv
+emoji: 🤖
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
+base_path: /web
 ---
+# SQLEnv: Teaching Agents to Explore Databases
+![Python](https://img.shields.io/badge/python-3.12-blue.svg)
+![License](https://img.shields.io/badge/license-MIT-green.svg)
+SQLEnv is an interactive RL environment for text-to-SQL reasoning. Instead of producing one-shot SQL, agents learn to think like data analysts: inspect schema, sample rows, run exploratory queries, and submit a final answer with confidence.
+Built for the [OpenEnv Challenge](https://github.com/meta-pytorch/OpenEnv), this project packages environment runtime, dense rewards, evaluation, and training hooks so others can reproduce results and iterate quickly.
+## Quick Start
+Run these three commands to install, validate, and smoke-test the environment:
+```bash
+uv sync
+uv run openenv validate --verbose
+uv run pytest tests/ -v
+```
+Local server run:
+```bash
+uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+Docker run:
+```bash
+docker build -t sql-env:latest -f server/Dockerfile .
+docker run -p 8000:8000 sql-env:latest
+```
+## Why SQLEnv
+Static text-to-SQL benchmarks reward final outputs, not reasoning quality. SQLEnv turns SQL generation into an interactive decision process with feedback at each step, making it suitable for RL training and behavior analysis.
+## Architecture
+```text
++-------------+      WebSocket       +----------------------+      SQLite
+| RL Agent    | <------------------> | SQLEnvClient         | <----------------+
+| (GRPO/TRL)  |                      | (client.py)          |                 |
++-------------+                      +----------+-----------+                 |
+                                              HTTP/WebSocket                  |
+                                                     |                         |
+                                                     v                         |
+                                       +--------------------------+            |
+                                       | FastAPI Server           |            |
+                                       | (server.app:app)         |            |
+                                       +------------+-------------+            |
+                                                    |                          |
+                                                    v                          |
+                                       +--------------------------+            |
+                                       | SQLEnvironment           |------------+
+                                       | step/reset/reward/verify |
+                                       +--------------------------+
+```
+## How It Works
+Each episode begins with a natural language question mapped to a hidden Spider database. The agent acts through four environment actions:
+| Action | Purpose | Typical Output |
+|--------|---------|----------------|
+| `DESCRIBE table_name` | Inspect schema and column metadata | Column names, types, row count |
+| `SAMPLE table_name` | Inspect representative rows | Small row sample |
+| `QUERY sql_string` | Execute read-only SQL in sandbox | Query result rows or SQL error |
+| `ANSWER value` | Submit final answer | Terminal reward and completion |
+Episode flow:
+1. `reset()` returns question context and available tables.
+2. `step()` executes one exploration action at a time.
+3. `ANSWER` ends the episode with correctness-based terminal reward.
+## Train an Agent
+Use the GRPO training pipeline artifacts from F006 and run the notebook workflow:
+- Notebook: `notebooks/train_grpo.ipynb`
+- Training support modules: `training/`
+- Evaluation utilities: `evaluation/`
+This setup is designed for Colab and local CPU/GPU environments.
+## HuggingFace Space
+- Live Space: `https://huggingface.co/spaces/<your-org-or-user>/sql-env` (update after push)
+- Health check: `curl https://<space-url>/health`
+- Deploy command: `uv run openenv push`
+## Project Structure
+```text
+sql-env/
+|- __init__.py
+|- client.py
+|- models.py
+|- openenv.yaml
+|- server/
+|  |- app.py
+|  |- sql_environment.py
+|  |- reward.py
+|  |- verifier.py
+|  `- Dockerfile
+|- data/
+|  |- databases/
+|  `- questions/
+|- training/
+|- evaluation/
+|- notebooks/
+|  `- train_grpo.ipynb
+|- specs/
+|- docs/
+`- tests/
+```
+## Deployment Checklist
+1. `uv run openenv validate --verbose`
+2. `uv run openenv build`
+3. `uv run openenv push`
+4. Verify `/health` and run one full episode through the client.
+## Links
+- OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
+- OpenEnv docs: https://meta-pytorch.org/OpenEnv/
+- Spider dataset: https://huggingface.co/datasets/xlangai/spider
+- TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv
+- Verification plan: `specs/F007-VERIFICATION_SPEC.md`

REVIEW_REPORT.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# Code Review Report: F006 Step 3.1 (`notebooks/train_grpo.ipynb`, `pyproject.toml`, `tests/e2e/test_training_e2e.py`)
+**Risk Tier:** Medium
+**Status:** Failed
+**Verdict:** BLOCK
+## Summary
+Step 3.1 is not ready to merge. The training extra currently resolves to a TRL version incompatible with the repo’s pinned Torch version, causing notebook imports to fail before training can start. In addition, the added E2E test only validates notebook structure and does not exercise the required one-step training smoke flow from the verification spec.
+## Evidence
+### Tests
+- **Status:** Passed (limited scope)
+- **Command:** `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v`
+- **Results:** `2 passed, 0 failed`
+### Dependency/Runtime Validation
+- **Status:** Failed
+- **Command:** `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"`
+- **Observed:** Import error (`cannot import name 'FSDPModule'`) in TRL with current Torch pin.
+### Security (Medium)
+- **Status:** Clear
+- **Checks:** Medium-tier quick checks only (no secrets/auth/unsafe execution patterns introduced in scoped changes).
+## Issues
+### Critical
+1. **Training extra resolves to incompatible TRL, breaking notebook startup**
+   - **Location:** `pyproject.toml:30-33`, `notebooks/train_grpo.ipynb:29-35`
+   - **Problem:** `training = ["trl>=0.12.0", "accelerate>=0.34.0"]` permits latest TRL (installed as 0.29.1), which fails to import with pinned `torch==2.2.2`.
+   - **Impact:** Notebook cannot run end-to-end (“one click” success criterion fails before training).
+   - **Fix:** Pin a TRL range compatible with Torch 2.2.2 (or upgrade Torch accordingly), then add/import-check coverage in tests.
+### Important
+1. **E2E smoke test does not validate actual Step 3.1 execution path**
+   - **Location:** `tests/e2e/test_training_e2e.py:25-65`
+   - **Problem:** Test checks notebook text structure and helper filtering only; it does not instantiate trainer, run `trainer.train()`, or verify metrics/comparison outputs as specified.
+   - **Impact:** Regressions in training flow can pass CI undetected.
+   - **Fix:** Add a true smoke execution test (tiny/mocked model + single train step + metric assertion), aligned to `specs/F006-VERIFICATION_SPEC.md` Section 4.
+2. **Comparison cell is not random-vs-trained and does not capture pre-training baseline**
+   - **Location:** `notebooks/train_grpo.ipynb:181-183`
+   - **Problem:** Both `before_rollouts` and `after_rollouts` use `rollout_func` with the same model after training.
+   - **Impact:** Fails the feature’s “before vs after” demo intent (and spec’s random-vs-trained comparison).
+   - **Fix:** Capture baseline episodes before training (or explicit random policy), then run trained-policy episodes after `trainer.train()`.
+### Minor
+None.
+## Next Actions
+1. Fix dependency compatibility (TRL/Torch) and prove imports succeed in clean env.
+2. Upgrade E2E smoke test to execute one real/mocked GRPO training step and assert logged metrics.
+3. Correct notebook comparison to true baseline-vs-trained behavior.
+4. Re-run: `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v` and include import-check evidence.

__init__.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""SQLEnv: Interactive Database Query Environment for the OpenEnv Challenge."""
+# ---------------------------------------------------------------------------
+# Pydantic / TypedDict compatibility shim
+# ---------------------------------------------------------------------------
+# The openenv library defines ``Message`` with ``typing.TypedDict``.
+# On Python < 3.12, Pydantic 2.x rejects ``typing.TypedDict`` in model
+# fields; it requires ``typing_extensions.TypedDict`` instead.  We patch
+# ``typing.TypedDict`` early so that all downstream imports see the
+# compatible version before any Pydantic model is constructed.
+import sys
+if sys.version_info < (3, 12):
+    import typing
+    import typing_extensions
+    typing.TypedDict = typing_extensions.TypedDict  # type: ignore[attr-defined]
+try:
+    from .models import SQLAction, SQLObservation, SQLState
+except ImportError:
+    # When pytest imports this file standalone (not as part of the sql_env
+    # package), relative imports fail. Fall back to absolute imports.
+    try:
+        from sql_env.models import SQLAction, SQLObservation, SQLState  # type: ignore[no-redef]
+    except ImportError:
+        pass  # Imports not available; this file is being collected, not used.
+# Client is not imported at package level to avoid loading torch unnecessarily.
+# Import it explicitly when needed: from sql_env.client import SQLEnvClient
+__all__ = [
+    "SQLAction",
+    "SQLObservation",
+    "SQLState",
+]

client.py ADDED Viewed

	@@ -0,0 +1,140 @@

+from typing import Any, Dict, Iterable
+import torch
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.interfaces import Message
+from openenv.core.env_client import EnvClient
+from .models import SQLAction, SQLObservation, SQLState
+class SQLEnvClient(EnvClient[SQLAction, SQLObservation, SQLState]):
+    """Client for interacting with the SQLEnv environment server."""
+    def _step_payload(self, action: SQLAction) -> Dict[str, Any]:
+        """Convert a SQLAction into the payload for the step endpoint."""
+        return {
+            "action_type": action.action_type,
+            "argument": action.argument,
+            "metadata": action.metadata,
+        }
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[SQLObservation]:
+        """Parse the response from the step endpoint into a StepResult."""
+        obs_data = payload.get("observation")
+        if not isinstance(obs_data, dict):
+            obs_data = payload
+        done = payload.get("done", obs_data.get("done", False))
+        reward = payload.get("reward", obs_data.get("reward"))
+        observation = SQLObservation(
+            question=str(obs_data.get("question", "")),
+            schema_info=str(obs_data.get("schema_info", "")),
+            result=str(obs_data.get("result", "")),
+            error=str(obs_data.get("error", "")),
+            step_count=int(obs_data.get("step_count", 0)),
+            budget_remaining=int(obs_data.get("budget_remaining", 0)),
+            action_history=list(obs_data.get("action_history", [])),
+            done=bool(done),
+            reward=reward,
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=reward,
+            done=bool(done),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> SQLState:
+        # Parse history messages
+        history_messages = payload.get("history_messages", [])
+        # Parse history tokens - convert lists back to tensors
+        history_tokens_data = payload.get("history_tokens", [])
+        history_tokens = []
+        for token_list in history_tokens_data:
+            if token_list:
+                history_tokens.append(torch.tensor(token_list))
+            else:
+                history_tokens.append(torch.tensor([]))
+        return SQLState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            history_messages=history_messages,
+            history_tokens=history_tokens,
+            current_action_type=payload.get("current_action_type", "query"),
+        )
+    def _detect_action_type(self, message_content: str) -> str:
+        """Detect the action type from user message content."""
+        content_lower = message_content.lower()
+        if content_lower.startswith("answer "):
+            return "ANSWER"
+        describe_keywords = [
+            "describe",
+            "schema",
+            "columns",
+            "structure",
+            "what columns",
+            "show columns",
+        ]
+        if any(keyword in content_lower for keyword in describe_keywords):
+            return "DESCRIBE"
+        sample_keywords = [
+            "sample",
+            "example",
+            "rows",
+            "data",
+            "show me",
+            "few rows",
+            "how many",
+        ]
+        if any(keyword in content_lower for keyword in sample_keywords):
+            return "SAMPLE"
+        return "QUERY"
+    def message_to_action(
+        self,
+        message: Message,
+        tokenizer: Any,
+        history_messages: Iterable[Message] | None = None,
+    ) -> SQLAction:
+        """Convert a user Message into a SQLAction."""
+        if "role" not in message:
+            raise ValueError("Message must contain a 'role' key")
+        if "content" not in message:
+            raise ValueError("Message must contain a 'content' key")
+        if message["content"] is None:
+            raise ValueError("Message content cannot be None")
+        _ = tokenizer
+        _ = history_messages
+        content = str(message["content"])
+        parsed = content.strip()
+        action_type = "QUERY"
+        argument = content
+        if message["role"].lower() == "user" and parsed:
+            prefix, separator, remainder = parsed.partition(" ")
+            normalized_prefix = prefix.upper()
+            if normalized_prefix in {"DESCRIBE", "SAMPLE", "QUERY", "ANSWER"}:
+                action_type = normalized_prefix
+                argument = remainder if separator else ""
+            else:
+                action_type = self._detect_action_type(parsed)
+                argument = parsed
+        return SQLAction(
+            action_type=action_type,
+            argument=argument,
+        )

conftest.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ """Pytest configuration — exclude package __init__.py from collection."""
2	+
3	+ collect_ignore = ["__init__.py"]

data/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """SQLEnv data package — databases and question sets."""

data/databases/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """SQLAlchemy ORM models for SQLEnv databases."""

data/databases/models.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""
+SQLAlchemy ORM models for the university course management database.
+This module defines all tables using SQLAlchemy declarative syntax with proper
+relationships and data types.
+"""
+from datetime import datetime
+from sqlalchemy import Column, Integer, String, DateTime, ForeignKey
+from sqlalchemy.orm import declarative_base, relationship
+Base = declarative_base()
+class Address(Base):
+    """Address information for people."""
+    __tablename__ = "Addresses"
+    address_id = Column(Integer, primary_key=True, autoincrement=True)
+    line_1 = Column(String(255), nullable=False)
+    line_2 = Column(String(255))
+    city = Column(String(100))
+    zip_postcode = Column(String(20))
+    state_province_county = Column(String(100))
+    country = Column(String(100))
+    # Relationships
+    people_addresses = relationship("PersonAddress", back_populates="address")
+class Person(Base):
+    """Person information."""
+    __tablename__ = "People"
+    person_id = Column(Integer, primary_key=True, autoincrement=True)
+    first_name = Column(String(100), nullable=False)
+    middle_name = Column(String(100))
+    last_name = Column(String(100), nullable=False)
+    cell_mobile_number = Column(String(20))
+    email_address = Column(String(255))
+    login_name = Column(String(100), unique=True)
+    password = Column(String(255))
+    # Relationships
+    people_addresses = relationship("PersonAddress", back_populates="person")
+class Student(Base):
+    """Student information."""
+    __tablename__ = "Students"
+    student_id = Column(Integer, primary_key=True, autoincrement=True)
+    student_details = Column(String(500))
+    # Relationships
+    course_registrations = relationship(
+        "StudentCourseRegistration", back_populates="student"
+    )
+    course_attendance = relationship(
+        "StudentCourseAttendance", back_populates="student"
+    )
+class Course(Base):
+    """Course information."""
+    __tablename__ = "Courses"
+    course_id = Column(String(50), primary_key=True)
+    course_name = Column(String(200), nullable=False)
+    course_description = Column(String(500))
+    other_details = Column(String(500))
+    # Relationships
+    course_registrations = relationship(
+        "StudentCourseRegistration", back_populates="course"
+    )
+    course_attendance = relationship("StudentCourseAttendance", back_populates="course")
+class PersonAddress(Base):
+    """Link between people and their addresses with date ranges."""
+    __tablename__ = "People_Addresses"
+    person_address_id = Column(Integer, primary_key=True, autoincrement=True)
+    person_id = Column(Integer, ForeignKey("People.person_id"), nullable=False)
+    address_id = Column(Integer, ForeignKey("Addresses.address_id"), nullable=False)
+    date_from = Column(DateTime)
+    date_to = Column(DateTime)
+    # Relationships
+    person = relationship("Person", back_populates="people_addresses")
+    address = relationship("Address", back_populates="people_addresses")
+class StudentCourseRegistration(Base):
+    """Student registration for courses."""
+    __tablename__ = "Student_Course_Registrations"
+    student_id = Column(Integer, ForeignKey("Students.student_id"), primary_key=True)
+    course_id = Column(String(50), ForeignKey("Courses.course_id"), primary_key=True)
+    registration_date = Column(DateTime, default=datetime.utcnow)
+    # Relationships
+    student = relationship("Student", back_populates="course_registrations")
+    course = relationship("Course", back_populates="course_registrations")
+class StudentCourseAttendance(Base):
+    """Student attendance records for courses."""
+    __tablename__ = "Student_Course_Attendance"
+    student_id = Column(Integer, ForeignKey("Students.student_id"), primary_key=True)
+    course_id = Column(String(50), ForeignKey("Courses.course_id"), primary_key=True)
+    date_of_attendance = Column(DateTime, primary_key=True)
+    # Relationships
+    student = relationship("Student", back_populates="course_attendance")
+    course = relationship("Course", back_populates="course_attendance")
+class Candidate(Base):
+    """Candidate information."""
+    __tablename__ = "Candidates"
+    candidate_id = Column(Integer, primary_key=True, autoincrement=True)
+    candidate_details = Column(String(500))
+    # Relationships
+    assessments = relationship("CandidateAssessment", back_populates="candidate")
+class CandidateAssessment(Base):
+    """Assessment records for candidates."""
+    __tablename__ = "Candidate_Assessments"
+    candidate_id = Column(
+        Integer, ForeignKey("Candidates.candidate_id"), primary_key=True
+    )
+    qualification = Column(String(200), primary_key=True)
+    assessment_date = Column(DateTime, primary_key=True)
+    asessment_outcome_code = Column(String(50))
+    # Relationships
+    candidate = relationship("Candidate", back_populates="assessments")

data/questions/db_list.json ADDED Viewed

	@@ -0,0 +1,12 @@

+[
+  "student_assessment",
+  "concert_singer",
+  "world_1",
+  "car_1",
+  "employee_hire_evaluation",
+  "pets_1",
+  "cre_Doc_Template_Mgt",
+  "dog_kennels",
+  "flight_2",
+  "poker_player"
+]

data/questions/questions_eval.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/questions/questions_train.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/questions/student_assessment.json ADDED Viewed

	@@ -0,0 +1,3355 @@

+[
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1",
+    "question": "which course has most number of registered students?",
+    "query_toks": [
+      "SELECT",
+      "T1.course_name",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_Id",
+      "GROUP",
+      "BY",
+      "T1.course_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "course_name",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "course_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "which",
+      "course",
+      "has",
+      "most",
+      "number",
+      "of",
+      "registered",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1",
+    "question": "What is the name of the course with the most registered students?",
+    "query_toks": [
+      "SELECT",
+      "T1.course_name",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_Id",
+      "GROUP",
+      "BY",
+      "T1.course_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "course_name",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "course_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "the",
+      "name",
+      "of",
+      "the",
+      "course",
+      "with",
+      "the",
+      "most",
+      "registered",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_registrations GROUP BY student_id ORDER BY count(*) LIMIT 1",
+    "question": "what is id of students who registered some courses but the least number of courses in these students?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_registrations",
+      "GROUP",
+      "BY",
+      "student_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_registrations",
+      "group",
+      "by",
+      "student_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "what",
+      "is",
+      "id",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "some",
+      "courses",
+      "but",
+      "the",
+      "least",
+      "number",
+      "of",
+      "courses",
+      "in",
+      "these",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_registrations GROUP BY student_id ORDER BY count(*) LIMIT 1",
+    "question": "What are the ids of the students who registered for some courses but had the least number of courses for all students?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_registrations",
+      "GROUP",
+      "BY",
+      "student_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_registrations",
+      "group",
+      "by",
+      "student_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "for",
+      "some",
+      "courses",
+      "but",
+      "had",
+      "the",
+      "least",
+      "number",
+      "of",
+      "courses",
+      "for",
+      "all",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.first_name ,  T2.last_name FROM candidates AS T1 JOIN people AS T2 ON T1.candidate_id = T2.person_id",
+    "question": "what are the first name and last name of all candidates?",
+    "query_toks": [
+      "SELECT",
+      "T2.first_name",
+      ",",
+      "T2.last_name",
+      "FROM",
+      "candidates",
+      "AS",
+      "T1",
+      "JOIN",
+      "people",
+      "AS",
+      "T2",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T2.person_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "first_name",
+      ",",
+      "t2",
+      ".",
+      "last_name",
+      "from",
+      "candidates",
+      "as",
+      "t1",
+      "join",
+      "people",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t2",
+      ".",
+      "person_id"
+    ],
+    "question_toks": [
+      "what",
+      "are",
+      "the",
+      "first",
+      "name",
+      "and",
+      "last",
+      "name",
+      "of",
+      "all",
+      "candidates",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.first_name ,  T2.last_name FROM candidates AS T1 JOIN people AS T2 ON T1.candidate_id = T2.person_id",
+    "question": "What are the first and last names of all the candidates?",
+    "query_toks": [
+      "SELECT",
+      "T2.first_name",
+      ",",
+      "T2.last_name",
+      "FROM",
+      "candidates",
+      "AS",
+      "T1",
+      "JOIN",
+      "people",
+      "AS",
+      "T2",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T2.person_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "first_name",
+      ",",
+      "t2",
+      ".",
+      "last_name",
+      "from",
+      "candidates",
+      "as",
+      "t1",
+      "join",
+      "people",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t2",
+      ".",
+      "person_id"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "first",
+      "and",
+      "last",
+      "names",
+      "of",
+      "all",
+      "the",
+      "candidates",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM students WHERE student_id NOT IN (SELECT student_id FROM student_course_attendance)",
+    "question": "List the id of students who never attends courses?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "students",
+      "WHERE",
+      "student_id",
+      "NOT",
+      "IN",
+      "(",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      ")"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "students",
+      "where",
+      "student_id",
+      "not",
+      "in",
+      "(",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      ")"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "never",
+      "attends",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM students WHERE student_id NOT IN (SELECT student_id FROM student_course_attendance)",
+    "question": "What are the  ids of every student who has never attended a course?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "students",
+      "WHERE",
+      "student_id",
+      "NOT",
+      "IN",
+      "(",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      ")"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "students",
+      "where",
+      "student_id",
+      "not",
+      "in",
+      "(",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      ")"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "every",
+      "student",
+      "who",
+      "has",
+      "never",
+      "attended",
+      "a",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance",
+    "question": "List the id of students who attended some courses?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "attended",
+      "some",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance",
+    "question": "What are the ids of all students who have attended at least one course?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "all",
+      "students",
+      "who",
+      "have",
+      "attended",
+      "at",
+      "least",
+      "one",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.student_id ,  T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id",
+    "question": "What are the ids of all students for courses and what are the names of those courses?",
+    "query_toks": [
+      "SELECT",
+      "T1.student_id",
+      ",",
+      "T2.course_name",
+      "FROM",
+      "student_course_registrations",
+      "AS",
+      "T1",
+      "JOIN",
+      "courses",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "student_id",
+      ",",
+      "t2",
+      ".",
+      "course_name",
+      "from",
+      "student_course_registrations",
+      "as",
+      "t1",
+      "join",
+      "courses",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "all",
+      "students",
+      "for",
+      "courses",
+      "and",
+      "what",
+      "are",
+      "the",
+      "names",
+      "of",
+      "those",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_details FROM student_course_registrations AS T1 JOIN students AS T2 ON T1.student_id = T2.student_id ORDER BY T1.registration_date DESC LIMIT 1",
+    "question": "What is detail of the student who most recently registered course?",
+    "query_toks": [
+      "SELECT",
+      "T2.student_details",
+      "FROM",
+      "student_course_registrations",
+      "AS",
+      "T1",
+      "JOIN",
+      "students",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "ORDER",
+      "BY",
+      "T1.registration_date",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_details",
+      "from",
+      "student_course_registrations",
+      "as",
+      "t1",
+      "join",
+      "students",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "order",
+      "by",
+      "t1",
+      ".",
+      "registration_date",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "detail",
+      "of",
+      "the",
+      "student",
+      "who",
+      "most",
+      "recently",
+      "registered",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_details FROM student_course_registrations AS T1 JOIN students AS T2 ON T1.student_id = T2.student_id ORDER BY T1.registration_date DESC LIMIT 1",
+    "question": "What details do we have on the students who registered for courses most recently?",
+    "query_toks": [
+      "SELECT",
+      "T2.student_details",
+      "FROM",
+      "student_course_registrations",
+      "AS",
+      "T1",
+      "JOIN",
+      "students",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "ORDER",
+      "BY",
+      "T1.registration_date",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_details",
+      "from",
+      "student_course_registrations",
+      "as",
+      "t1",
+      "join",
+      "students",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "order",
+      "by",
+      "t1",
+      ".",
+      "registration_date",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "details",
+      "do",
+      "we",
+      "have",
+      "on",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "for",
+      "courses",
+      "most",
+      "recently",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT count(*) FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"English\"",
+    "question": "How many students attend course English?",
+    "query_toks": [
+      "SELECT",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "English",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "How",
+      "many",
+      "students",
+      "attend",
+      "course",
+      "English",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT count(*) FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"English\"",
+    "question": "How many students are attending English courses?",
+    "query_toks": [
+      "SELECT",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "English",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "How",
+      "many",
+      "students",
+      "are",
+      "attending",
+      "English",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT count(*) FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T2.student_id = 171",
+    "question": "How many courses do the student whose id is 171 attend?",
+    "query_toks": [
+      "SELECT",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T2.student_id",
+      "=",
+      "171"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t2",
+      ".",
+      "student_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "How",
+      "many",
+      "courses",
+      "do",
+      "the",
+      "student",
+      "whose",
+      "id",
+      "is",
+      "171",
+      "attend",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT count(*) FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T2.student_id = 171",
+    "question": "How many courses does the student with id 171 actually attend?",
+    "query_toks": [
+      "SELECT",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T2.student_id",
+      "=",
+      "171"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t2",
+      ".",
+      "student_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "How",
+      "many",
+      "courses",
+      "does",
+      "the",
+      "student",
+      "with",
+      "id",
+      "171",
+      "actually",
+      "attend",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.candidate_id FROM people AS T1 JOIN candidates AS T2 ON T1.person_id = T2.candidate_id WHERE T1.email_address = \"stanley.monahan@example.org\"",
+    "question": "Find id of the candidate whose email is stanley.monahan@example.org?",
+    "query_toks": [
+      "SELECT",
+      "T2.candidate_id",
+      "FROM",
+      "people",
+      "AS",
+      "T1",
+      "JOIN",
+      "candidates",
+      "AS",
+      "T2",
+      "ON",
+      "T1.person_id",
+      "=",
+      "T2.candidate_id",
+      "WHERE",
+      "T1.email_address",
+      "=",
+      "``",
+      "stanley.monahan",
+      "@",
+      "example.org",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "candidate_id",
+      "from",
+      "people",
+      "as",
+      "t1",
+      "join",
+      "candidates",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "person_id",
+      "=",
+      "t2",
+      ".",
+      "candidate_id",
+      "where",
+      "t1",
+      ".",
+      "email_address",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "Find",
+      "id",
+      "of",
+      "the",
+      "candidate",
+      "whose",
+      "email",
+      "is",
+      "stanley.monahan",
+      "@",
+      "example.org",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.candidate_id FROM people AS T1 JOIN candidates AS T2 ON T1.person_id = T2.candidate_id WHERE T1.email_address = \"stanley.monahan@example.org\"",
+    "question": "What is the id of the candidate whose email is stanley.monahan@example.org?",
+    "query_toks": [
+      "SELECT",
+      "T2.candidate_id",
+      "FROM",
+      "people",
+      "AS",
+      "T1",
+      "JOIN",
+      "candidates",
+      "AS",
+      "T2",
+      "ON",
+      "T1.person_id",
+      "=",
+      "T2.candidate_id",
+      "WHERE",
+      "T1.email_address",
+      "=",
+      "``",
+      "stanley.monahan",
+      "@",
+      "example.org",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "candidate_id",
+      "from",
+      "people",
+      "as",
+      "t1",
+      "join",
+      "candidates",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "person_id",
+      "=",
+      "t2",
+      ".",
+      "candidate_id",
+      "where",
+      "t1",
+      ".",
+      "email_address",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "the",
+      "id",
+      "of",
+      "the",
+      "candidate",
+      "whose",
+      "email",
+      "is",
+      "stanley.monahan",
+      "@",
+      "example.org",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT candidate_id FROM candidate_assessments ORDER BY assessment_date DESC LIMIT 1",
+    "question": "Find id of the candidate who most recently accessed the course?",
+    "query_toks": [
+      "SELECT",
+      "candidate_id",
+      "FROM",
+      "candidate_assessments",
+      "ORDER",
+      "BY",
+      "assessment_date",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "candidate_id",
+      "from",
+      "candidate_assessments",
+      "order",
+      "by",
+      "assessment_date",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "Find",
+      "id",
+      "of",
+      "the",
+      "candidate",
+      "who",
+      "most",
+      "recently",
+      "accessed",
+      "the",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT candidate_id FROM candidate_assessments ORDER BY assessment_date DESC LIMIT 1",
+    "question": "What is the id of the candidate who most recently accessed the course?",
+    "query_toks": [
+      "SELECT",
+      "candidate_id",
+      "FROM",
+      "candidate_assessments",
+      "ORDER",
+      "BY",
+      "assessment_date",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "candidate_id",
+      "from",
+      "candidate_assessments",
+      "order",
+      "by",
+      "assessment_date",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "the",
+      "id",
+      "of",
+      "the",
+      "candidate",
+      "who",
+      "most",
+      "recently",
+      "accessed",
+      "the",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.student_details FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id GROUP BY T1.student_id ORDER BY count(*) DESC LIMIT 1",
+    "question": "What is detail of the student who registered the most number of courses?",
+    "query_toks": [
+      "SELECT",
+      "T1.student_details",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "GROUP",
+      "BY",
+      "T1.student_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "student_details",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "student_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "detail",
+      "of",
+      "the",
+      "student",
+      "who",
+      "registered",
+      "the",
+      "most",
+      "number",
+      "of",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.student_details FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id GROUP BY T1.student_id ORDER BY count(*) DESC LIMIT 1",
+    "question": "What are the details of the student who registered for the most number of courses?",
+    "query_toks": [
+      "SELECT",
+      "T1.student_details",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "GROUP",
+      "BY",
+      "T1.student_id",
+      "ORDER",
+      "BY",
+      "count",
+      "(",
+      "*",
+      ")",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "student_details",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "student_id",
+      "order",
+      "by",
+      "count",
+      "(",
+      "*",
+      ")",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "details",
+      "of",
+      "the",
+      "student",
+      "who",
+      "registered",
+      "for",
+      "the",
+      "most",
+      "number",
+      "of",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.student_id ,  count(*) FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id GROUP BY T1.student_id",
+    "question": "List the id of students who registered some courses and the number of their registered courses?",
+    "query_toks": [
+      "SELECT",
+      "T1.student_id",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "GROUP",
+      "BY",
+      "T1.student_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "student_id",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "student_id"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "some",
+      "courses",
+      "and",
+      "the",
+      "number",
+      "of",
+      "their",
+      "registered",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T1.student_id ,  count(*) FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id GROUP BY T1.student_id",
+    "question": "For every student who is registered for some course, how many courses are they registered for?",
+    "query_toks": [
+      "SELECT",
+      "T1.student_id",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "GROUP",
+      "BY",
+      "T1.student_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t1",
+      ".",
+      "student_id",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "group",
+      "by",
+      "t1",
+      ".",
+      "student_id"
+    ],
+    "question_toks": [
+      "For",
+      "every",
+      "student",
+      "who",
+      "is",
+      "registered",
+      "for",
+      "some",
+      "course",
+      ",",
+      "how",
+      "many",
+      "courses",
+      "are",
+      "they",
+      "registered",
+      "for",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T3.course_name ,  count(*) FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id JOIN courses AS T3 ON T2.course_id = T3.course_id GROUP BY T2.course_id",
+    "question": "How many registed students do each course have? List course name and the number of their registered students?",
+    "query_toks": [
+      "SELECT",
+      "T3.course_name",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "JOIN",
+      "courses",
+      "AS",
+      "T3",
+      "ON",
+      "T2.course_id",
+      "=",
+      "T3.course_id",
+      "GROUP",
+      "BY",
+      "T2.course_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t3",
+      ".",
+      "course_name",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "join",
+      "courses",
+      "as",
+      "t3",
+      "on",
+      "t2",
+      ".",
+      "course_id",
+      "=",
+      "t3",
+      ".",
+      "course_id",
+      "group",
+      "by",
+      "t2",
+      ".",
+      "course_id"
+    ],
+    "question_toks": [
+      "How",
+      "many",
+      "registed",
+      "students",
+      "do",
+      "each",
+      "course",
+      "have",
+      "?",
+      "List",
+      "course",
+      "name",
+      "and",
+      "the",
+      "number",
+      "of",
+      "their",
+      "registered",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T3.course_name ,  count(*) FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id JOIN courses AS T3 ON T2.course_id = T3.course_id GROUP BY T2.course_id",
+    "question": "For each course id, how many students are registered and what are the course names?",
+    "query_toks": [
+      "SELECT",
+      "T3.course_name",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "FROM",
+      "students",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.student_id",
+      "=",
+      "T2.student_id",
+      "JOIN",
+      "courses",
+      "AS",
+      "T3",
+      "ON",
+      "T2.course_id",
+      "=",
+      "T3.course_id",
+      "GROUP",
+      "BY",
+      "T2.course_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t3",
+      ".",
+      "course_name",
+      ",",
+      "count",
+      "(",
+      "*",
+      ")",
+      "from",
+      "students",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "student_id",
+      "=",
+      "t2",
+      ".",
+      "student_id",
+      "join",
+      "courses",
+      "as",
+      "t3",
+      "on",
+      "t2",
+      ".",
+      "course_id",
+      "=",
+      "t3",
+      ".",
+      "course_id",
+      "group",
+      "by",
+      "t2",
+      ".",
+      "course_id"
+    ],
+    "question_toks": [
+      "For",
+      "each",
+      "course",
+      "id",
+      ",",
+      "how",
+      "many",
+      "students",
+      "are",
+      "registered",
+      "and",
+      "what",
+      "are",
+      "the",
+      "course",
+      "names",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT candidate_id FROM candidate_assessments WHERE asessment_outcome_code = \"Pass\"",
+    "question": "Find id of candidates whose assessment code is \"Pass\"?",
+    "query_toks": [
+      "SELECT",
+      "candidate_id",
+      "FROM",
+      "candidate_assessments",
+      "WHERE",
+      "asessment_outcome_code",
+      "=",
+      "``",
+      "Pass",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "candidate_id",
+      "from",
+      "candidate_assessments",
+      "where",
+      "asessment_outcome_code",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "Find",
+      "id",
+      "of",
+      "candidates",
+      "whose",
+      "assessment",
+      "code",
+      "is",
+      "``",
+      "Pass",
+      "''",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT candidate_id FROM candidate_assessments WHERE asessment_outcome_code = \"Pass\"",
+    "question": "What are the ids of the candidates that have an outcome code of Pass?",
+    "query_toks": [
+      "SELECT",
+      "candidate_id",
+      "FROM",
+      "candidate_assessments",
+      "WHERE",
+      "asessment_outcome_code",
+      "=",
+      "``",
+      "Pass",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "candidate_id",
+      "from",
+      "candidate_assessments",
+      "where",
+      "asessment_outcome_code",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "candidates",
+      "that",
+      "have",
+      "an",
+      "outcome",
+      "code",
+      "of",
+      "Pass",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T3.cell_mobile_number FROM candidates AS T1 JOIN candidate_assessments AS T2 ON T1.candidate_id = T2.candidate_id JOIN people AS T3 ON T1.candidate_id = T3.person_id WHERE T2.asessment_outcome_code = \"Fail\"",
+    "question": "Find the cell mobile number of the candidates whose assessment code is \"Fail\"?",
+    "query_toks": [
+      "SELECT",
+      "T3.cell_mobile_number",
+      "FROM",
+      "candidates",
+      "AS",
+      "T1",
+      "JOIN",
+      "candidate_assessments",
+      "AS",
+      "T2",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T2.candidate_id",
+      "JOIN",
+      "people",
+      "AS",
+      "T3",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T3.person_id",
+      "WHERE",
+      "T2.asessment_outcome_code",
+      "=",
+      "``",
+      "Fail",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t3",
+      ".",
+      "cell_mobile_number",
+      "from",
+      "candidates",
+      "as",
+      "t1",
+      "join",
+      "candidate_assessments",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t2",
+      ".",
+      "candidate_id",
+      "join",
+      "people",
+      "as",
+      "t3",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t3",
+      ".",
+      "person_id",
+      "where",
+      "t2",
+      ".",
+      "asessment_outcome_code",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "Find",
+      "the",
+      "cell",
+      "mobile",
+      "number",
+      "of",
+      "the",
+      "candidates",
+      "whose",
+      "assessment",
+      "code",
+      "is",
+      "``",
+      "Fail",
+      "''",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T3.cell_mobile_number FROM candidates AS T1 JOIN candidate_assessments AS T2 ON T1.candidate_id = T2.candidate_id JOIN people AS T3 ON T1.candidate_id = T3.person_id WHERE T2.asessment_outcome_code = \"Fail\"",
+    "question": "What are the cell phone numbers of the candidates that received an assessment code of \"Fail\"?",
+    "query_toks": [
+      "SELECT",
+      "T3.cell_mobile_number",
+      "FROM",
+      "candidates",
+      "AS",
+      "T1",
+      "JOIN",
+      "candidate_assessments",
+      "AS",
+      "T2",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T2.candidate_id",
+      "JOIN",
+      "people",
+      "AS",
+      "T3",
+      "ON",
+      "T1.candidate_id",
+      "=",
+      "T3.person_id",
+      "WHERE",
+      "T2.asessment_outcome_code",
+      "=",
+      "``",
+      "Fail",
+      "''"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t3",
+      ".",
+      "cell_mobile_number",
+      "from",
+      "candidates",
+      "as",
+      "t1",
+      "join",
+      "candidate_assessments",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t2",
+      ".",
+      "candidate_id",
+      "join",
+      "people",
+      "as",
+      "t3",
+      "on",
+      "t1",
+      ".",
+      "candidate_id",
+      "=",
+      "t3",
+      ".",
+      "person_id",
+      "where",
+      "t2",
+      ".",
+      "asessment_outcome_code",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "cell",
+      "phone",
+      "numbers",
+      "of",
+      "the",
+      "candidates",
+      "that",
+      "received",
+      "an",
+      "assessment",
+      "code",
+      "of",
+      "``",
+      "Fail",
+      "''",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance WHERE course_id  =  301",
+    "question": "What are the id of students who registered course 301?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "course_id",
+      "=",
+      "301"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "course_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "course",
+      "301",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance WHERE course_id  =  301",
+    "question": "What are the ids of the students who registered for course 301?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "course_id",
+      "=",
+      "301"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "course_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "for",
+      "course",
+      "301",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance WHERE course_id = 301 ORDER BY date_of_attendance DESC LIMIT 1",
+    "question": "What is the id of the student who most recently registered course 301?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "course_id",
+      "=",
+      "301",
+      "ORDER",
+      "BY",
+      "date_of_attendance",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "course_id",
+      "=",
+      "value",
+      "order",
+      "by",
+      "date_of_attendance",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "is",
+      "the",
+      "id",
+      "of",
+      "the",
+      "student",
+      "who",
+      "most",
+      "recently",
+      "registered",
+      "course",
+      "301",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_attendance WHERE course_id = 301 ORDER BY date_of_attendance DESC LIMIT 1",
+    "question": "What are the ids of the students who registered for course 301 most recently?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "course_id",
+      "=",
+      "301",
+      "ORDER",
+      "BY",
+      "date_of_attendance",
+      "DESC",
+      "LIMIT",
+      "1"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "course_id",
+      "=",
+      "value",
+      "order",
+      "by",
+      "date_of_attendance",
+      "desc",
+      "limit",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "for",
+      "course",
+      "301",
+      "most",
+      "recently",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT DISTINCT T1.city FROM addresses AS T1 JOIN people_addresses AS T2 ON T1.address_id = T2.address_id",
+    "question": "Find distinct cities of addresses of people?",
+    "query_toks": [
+      "SELECT",
+      "DISTINCT",
+      "T1.city",
+      "FROM",
+      "addresses",
+      "AS",
+      "T1",
+      "JOIN",
+      "people_addresses",
+      "AS",
+      "T2",
+      "ON",
+      "T1.address_id",
+      "=",
+      "T2.address_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "distinct",
+      "t1",
+      ".",
+      "city",
+      "from",
+      "addresses",
+      "as",
+      "t1",
+      "join",
+      "people_addresses",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "address_id",
+      "=",
+      "t2",
+      ".",
+      "address_id"
+    ],
+    "question_toks": [
+      "Find",
+      "distinct",
+      "cities",
+      "of",
+      "addresses",
+      "of",
+      "people",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT DISTINCT T1.city FROM addresses AS T1 JOIN people_addresses AS T2 ON T1.address_id = T2.address_id",
+    "question": "What are the different cities where people live?",
+    "query_toks": [
+      "SELECT",
+      "DISTINCT",
+      "T1.city",
+      "FROM",
+      "addresses",
+      "AS",
+      "T1",
+      "JOIN",
+      "people_addresses",
+      "AS",
+      "T2",
+      "ON",
+      "T1.address_id",
+      "=",
+      "T2.address_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "distinct",
+      "t1",
+      ".",
+      "city",
+      "from",
+      "addresses",
+      "as",
+      "t1",
+      "join",
+      "people_addresses",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "address_id",
+      "=",
+      "t2",
+      ".",
+      "address_id"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "different",
+      "cities",
+      "where",
+      "people",
+      "live",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT DISTINCT T1.city FROM addresses AS T1 JOIN people_addresses AS T2 ON T1.address_id = T2.address_id JOIN students AS T3 ON T2.person_id = T3.student_id",
+    "question": "Find distinct cities of address of students?",
+    "query_toks": [
+      "SELECT",
+      "DISTINCT",
+      "T1.city",
+      "FROM",
+      "addresses",
+      "AS",
+      "T1",
+      "JOIN",
+      "people_addresses",
+      "AS",
+      "T2",
+      "ON",
+      "T1.address_id",
+      "=",
+      "T2.address_id",
+      "JOIN",
+      "students",
+      "AS",
+      "T3",
+      "ON",
+      "T2.person_id",
+      "=",
+      "T3.student_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "distinct",
+      "t1",
+      ".",
+      "city",
+      "from",
+      "addresses",
+      "as",
+      "t1",
+      "join",
+      "people_addresses",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "address_id",
+      "=",
+      "t2",
+      ".",
+      "address_id",
+      "join",
+      "students",
+      "as",
+      "t3",
+      "on",
+      "t2",
+      ".",
+      "person_id",
+      "=",
+      "t3",
+      ".",
+      "student_id"
+    ],
+    "question_toks": [
+      "Find",
+      "distinct",
+      "cities",
+      "of",
+      "address",
+      "of",
+      "students",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT DISTINCT T1.city FROM addresses AS T1 JOIN people_addresses AS T2 ON T1.address_id = T2.address_id JOIN students AS T3 ON T2.person_id = T3.student_id",
+    "question": "What are the different cities where students live?",
+    "query_toks": [
+      "SELECT",
+      "DISTINCT",
+      "T1.city",
+      "FROM",
+      "addresses",
+      "AS",
+      "T1",
+      "JOIN",
+      "people_addresses",
+      "AS",
+      "T2",
+      "ON",
+      "T1.address_id",
+      "=",
+      "T2.address_id",
+      "JOIN",
+      "students",
+      "AS",
+      "T3",
+      "ON",
+      "T2.person_id",
+      "=",
+      "T3.student_id"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "distinct",
+      "t1",
+      ".",
+      "city",
+      "from",
+      "addresses",
+      "as",
+      "t1",
+      "join",
+      "people_addresses",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "address_id",
+      "=",
+      "t2",
+      ".",
+      "address_id",
+      "join",
+      "students",
+      "as",
+      "t3",
+      "on",
+      "t2",
+      ".",
+      "person_id",
+      "=",
+      "t3",
+      ".",
+      "student_id"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "different",
+      "cities",
+      "where",
+      "students",
+      "live",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT course_name FROM courses ORDER BY course_name",
+    "question": "List the names of courses in alphabetical order?",
+    "query_toks": [
+      "SELECT",
+      "course_name",
+      "FROM",
+      "courses",
+      "ORDER",
+      "BY",
+      "course_name"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "course_name",
+      "from",
+      "courses",
+      "order",
+      "by",
+      "course_name"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "names",
+      "of",
+      "courses",
+      "in",
+      "alphabetical",
+      "order",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT course_name FROM courses ORDER BY course_name",
+    "question": "What are the names of the courses in alphabetical order?",
+    "query_toks": [
+      "SELECT",
+      "course_name",
+      "FROM",
+      "courses",
+      "ORDER",
+      "BY",
+      "course_name"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "course_name",
+      "from",
+      "courses",
+      "order",
+      "by",
+      "course_name"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "names",
+      "of",
+      "the",
+      "courses",
+      "in",
+      "alphabetical",
+      "order",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT first_name FROM people ORDER BY first_name",
+    "question": "List the first names of people in alphabetical order?",
+    "query_toks": [
+      "SELECT",
+      "first_name",
+      "FROM",
+      "people",
+      "ORDER",
+      "BY",
+      "first_name"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "first_name",
+      "from",
+      "people",
+      "order",
+      "by",
+      "first_name"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "first",
+      "names",
+      "of",
+      "people",
+      "in",
+      "alphabetical",
+      "order",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT first_name FROM people ORDER BY first_name",
+    "question": "What are the first names of the people in alphabetical order?",
+    "query_toks": [
+      "SELECT",
+      "first_name",
+      "FROM",
+      "people",
+      "ORDER",
+      "BY",
+      "first_name"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "first_name",
+      "from",
+      "people",
+      "order",
+      "by",
+      "first_name"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "first",
+      "names",
+      "of",
+      "the",
+      "people",
+      "in",
+      "alphabetical",
+      "order",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_registrations UNION SELECT student_id FROM student_course_attendance",
+    "question": "What are the id of students who registered courses or attended courses?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_registrations",
+      "UNION",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_registrations",
+      "union",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "courses",
+      "or",
+      "attended",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT student_id FROM student_course_registrations UNION SELECT student_id FROM student_course_attendance",
+    "question": "What are the ids of the students who either registered or attended a course?",
+    "query_toks": [
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_registrations",
+      "UNION",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "student_id",
+      "from",
+      "student_course_registrations",
+      "union",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "either",
+      "registered",
+      "or",
+      "attended",
+      "a",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT course_id FROM student_course_registrations WHERE student_id = 121 UNION SELECT course_id FROM student_course_attendance WHERE student_id = 121",
+    "question": "Find the id of courses which are registered or attended by student whose id is 121?",
+    "query_toks": [
+      "SELECT",
+      "course_id",
+      "FROM",
+      "student_course_registrations",
+      "WHERE",
+      "student_id",
+      "=",
+      "121",
+      "UNION",
+      "SELECT",
+      "course_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "student_id",
+      "=",
+      "121"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "course_id",
+      "from",
+      "student_course_registrations",
+      "where",
+      "student_id",
+      "=",
+      "value",
+      "union",
+      "select",
+      "course_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "student_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "Find",
+      "the",
+      "id",
+      "of",
+      "courses",
+      "which",
+      "are",
+      "registered",
+      "or",
+      "attended",
+      "by",
+      "student",
+      "whose",
+      "id",
+      "is",
+      "121",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT course_id FROM student_course_registrations WHERE student_id = 121 UNION SELECT course_id FROM student_course_attendance WHERE student_id = 121",
+    "question": "What are the ids of the courses that are registered or attended by the student whose id is 121?",
+    "query_toks": [
+      "SELECT",
+      "course_id",
+      "FROM",
+      "student_course_registrations",
+      "WHERE",
+      "student_id",
+      "=",
+      "121",
+      "UNION",
+      "SELECT",
+      "course_id",
+      "FROM",
+      "student_course_attendance",
+      "WHERE",
+      "student_id",
+      "=",
+      "121"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "course_id",
+      "from",
+      "student_course_registrations",
+      "where",
+      "student_id",
+      "=",
+      "value",
+      "union",
+      "select",
+      "course_id",
+      "from",
+      "student_course_attendance",
+      "where",
+      "student_id",
+      "=",
+      "value"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "courses",
+      "that",
+      "are",
+      "registered",
+      "or",
+      "attended",
+      "by",
+      "the",
+      "student",
+      "whose",
+      "id",
+      "is",
+      "121",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT * FROM student_course_registrations WHERE student_id NOT IN (SELECT student_id FROM student_course_attendance)",
+    "question": "What are all info of students who registered courses but not attended courses?",
+    "query_toks": [
+      "SELECT",
+      "*",
+      "FROM",
+      "student_course_registrations",
+      "WHERE",
+      "student_id",
+      "NOT",
+      "IN",
+      "(",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      ")"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "*",
+      "from",
+      "student_course_registrations",
+      "where",
+      "student_id",
+      "not",
+      "in",
+      "(",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      ")"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "all",
+      "info",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "courses",
+      "but",
+      "not",
+      "attended",
+      "courses",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT * FROM student_course_registrations WHERE student_id NOT IN (SELECT student_id FROM student_course_attendance)",
+    "question": "What are all details of the students who registered but did not attend any course?",
+    "query_toks": [
+      "SELECT",
+      "*",
+      "FROM",
+      "student_course_registrations",
+      "WHERE",
+      "student_id",
+      "NOT",
+      "IN",
+      "(",
+      "SELECT",
+      "student_id",
+      "FROM",
+      "student_course_attendance",
+      ")"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "*",
+      "from",
+      "student_course_registrations",
+      "where",
+      "student_id",
+      "not",
+      "in",
+      "(",
+      "select",
+      "student_id",
+      "from",
+      "student_course_attendance",
+      ")"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "all",
+      "details",
+      "of",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "but",
+      "did",
+      "not",
+      "attend",
+      "any",
+      "course",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_id FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"statistics\" ORDER BY T2.registration_date",
+    "question": "List the id of students who registered course statistics in the order of registration date.",
+    "query_toks": [
+      "SELECT",
+      "T2.student_id",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "statistics",
+      "''",
+      "ORDER",
+      "BY",
+      "T2.registration_date"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_id",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value",
+      "order",
+      "by",
+      "t2",
+      ".",
+      "registration_date"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "registered",
+      "course",
+      "statistics",
+      "in",
+      "the",
+      "order",
+      "of",
+      "registration",
+      "date",
+      "."
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_id FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"statistics\" ORDER BY T2.registration_date",
+    "question": "What are the ids of the students who registered course statistics by order of registration date?",
+    "query_toks": [
+      "SELECT",
+      "T2.student_id",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_registrations",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "statistics",
+      "''",
+      "ORDER",
+      "BY",
+      "T2.registration_date"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_id",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_registrations",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value",
+      "order",
+      "by",
+      "t2",
+      ".",
+      "registration_date"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "registered",
+      "course",
+      "statistics",
+      "by",
+      "order",
+      "of",
+      "registration",
+      "date",
+      "?"
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_id FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"statistics\" ORDER BY T2.date_of_attendance",
+    "question": "List the id of students who attended  statistics courses in the order of attendance date.",
+    "query_toks": [
+      "SELECT",
+      "T2.student_id",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "statistics",
+      "''",
+      "ORDER",
+      "BY",
+      "T2.date_of_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_id",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value",
+      "order",
+      "by",
+      "t2",
+      ".",
+      "date_of_attendance"
+    ],
+    "question_toks": [
+      "List",
+      "the",
+      "id",
+      "of",
+      "students",
+      "who",
+      "attended",
+      "statistics",
+      "courses",
+      "in",
+      "the",
+      "order",
+      "of",
+      "attendance",
+      "date",
+      "."
+    ]
+  },
+  {
+    "db_id": "student_assessment",
+    "query": "SELECT T2.student_id FROM courses AS T1 JOIN student_course_attendance AS T2 ON T1.course_id = T2.course_id WHERE T1.course_name = \"statistics\" ORDER BY T2.date_of_attendance",
+    "question": "What are the ids of the students who attended courses in the statistics department in order of attendance date.",
+    "query_toks": [
+      "SELECT",
+      "T2.student_id",
+      "FROM",
+      "courses",
+      "AS",
+      "T1",
+      "JOIN",
+      "student_course_attendance",
+      "AS",
+      "T2",
+      "ON",
+      "T1.course_id",
+      "=",
+      "T2.course_id",
+      "WHERE",
+      "T1.course_name",
+      "=",
+      "``",
+      "statistics",
+      "''",
+      "ORDER",
+      "BY",
+      "T2.date_of_attendance"
+    ],
+    "query_toks_no_value": [
+      "select",
+      "t2",
+      ".",
+      "student_id",
+      "from",
+      "courses",
+      "as",
+      "t1",
+      "join",
+      "student_course_attendance",
+      "as",
+      "t2",
+      "on",
+      "t1",
+      ".",
+      "course_id",
+      "=",
+      "t2",
+      ".",
+      "course_id",
+      "where",
+      "t1",
+      ".",
+      "course_name",
+      "=",
+      "value",
+      "order",
+      "by",
+      "t2",
+      ".",
+      "date_of_attendance"
+    ],
+    "question_toks": [
+      "What",
+      "are",
+      "the",
+      "ids",
+      "of",
+      "the",
+      "students",
+      "who",
+      "attended",
+      "courses",
+      "in",
+      "the",
+      "statistics",
+      "department",
+      "in",
+      "order",
+      "of",
+      "attendance",
+      "date",
+      "."
+    ]
+  }
+]

docs/ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,361 @@

+# Architecture
+> Last updated: 2026-02-28
+System map for SQLEnv — an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.
+**Goals:**
+- Show how components connect (system map + key flows)
+- Make hidden state explicit (what lives where)
+- Define shared interfaces (Pydantic models, WebSocket API)
+- Keep invariants legible (what must stay true)
+**Non-goals:**
+- CLI reference (see `docs/RUNBOOK.md`)
+- Per-feature implementation details (link to specs)
+---
+## System Map
+```text
+                         SQLEnv System
+  ================================================================
+  RL Training Loop                          SQLEnv Server (Docker)
+  ----------------                          ----------------------
+                                           +---------------------+
+  +------------+     WebSocket (JSON)      | server/app.py       |
+  | SQLEnv     |<=========================>| FastAPI + WS        |
+  | Client     |  SQLAction  -> server     |                     |
+  | (client.py)|  SQLObs    <- server      +----------+----------+
+  +-----+------+                                      |
+        |                                             v
+        | tensor <-> list                  +---------------------+
+        | serialization                    | SQLEnvironment      |
+        |                                  | (sql_environment.py)|
+  +-----v------+                           |                     |
+  | RL Agent   |                           | - reset() / step()  |
+  | (external) |                           | - action detection  |
+  | e.g. GRPO  |                           | - message_to_action |
+  +------------+                           +--+-------+-------+--+
+                                              |       |       |
+                                              v       v       v
+                                         +------+ +------+ +--------+
+                                         |Schema| |Sample| | Query  |
+                                         |Intro-| |Gen   | | (Ollama|
+                                         |spect.| |      | |  LLM)  |
+                                         +--+---+ +--+---+ +---+----+
+                                            |        |          |
+                                            v        v          v
+                                         +-------------------------+
+                                         | SQLAlchemy ORM Models   |
+                                         | (data/databases/        |
+                                         |  models.py)             |
+                                         | 9 tables:               |
+                                         | Address, Person,        |
+                                         | Student, Course, ...    |
+                                         +-------------------------+
+  Data (committed)                         External (optional)
+  ----------------                         -------------------
+  data/questions/                          +----------+
+    student_assessment.json                | Ollama   |
+    (53 Spider Q&A pairs)                  | LLM API  |
+                                           | :11434   |
+                                           +----------+
+```
+---
+## Component Inventory
+| Component | Owns | Entrypoint | State / Output |
+|-----------|------|------------|----------------|
+| **SQLEnvClient** | WebSocket transport, tensor serialization | `client.py` | Stateless (wraps server) |
+| **FastAPI app** | HTTP/WS endpoints, tokenizer factory | `server/app.py` | In-memory tokenizer |
+| **SQLEnvironment** | Episode lifecycle, action dispatch, state | `server/sql_environment.py` | `SQLState` (in-memory) |
+| **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) |
+| **ORM models** | Database schema definition | `data/databases/models.py` | SQLAlchemy metadata |
+| **Spider data** | Question-answer pairs | `data/questions/student_assessment.json` | 53 Q&A entries |
+| **MockTokenizer** | Dev/test tokenization (no GPU needed) | `server/test_sql_env.py` | Deterministic (ord/chr) |
+### External Services
+| Service | Purpose | Required | Fallback |
+|---------|---------|----------|----------|
+| Ollama (`localhost:11434`) | Table selection + SQL generation | No | First table in dict; query returns error string |
+---
+## Key Flows
+### Flow: Episode (Reset + Multi-Turn Steps)
+```text
+Client                    Server (SQLEnvironment)              Ollama
+  |                              |                               |
+  |--- reset() ----------------->|                               |
+  |                              |-- init state, system prompt   |
+  |                              |-- tokenize system message     |
+  |<-- SQLObservation -----------|   (MockTokenizer or HF)       |
+  |    .messages=[system]        |                               |
+  |    .tokens=shape([N])        |                               |
+  |                              |                               |
+  |--- message_to_action(msg) -->|                               |
+  |                              |-- detect action type          |
+  |                              |   (keyword matching)          |
+  |                              |-- append msg to history       |
+  |                              |-- tokenize full conversation  |
+  |<-- SQLAction ----------------|                               |
+  |    .action_type="describe"   |                               |
+  |    .tokens=shape([1,M])      |                               |
+  |                              |                               |
+  |--- step(action) ------------>|                               |
+  |                              |-- select table -------------->|
+  |                              |<-- table name (or fallback) --|
+  |                              |-- introspect ORM schema       |
+  |                              |-- append assistant msg        |
+  |                              |-- append action tokens        |
+  |<-- SQLObservation -----------|                               |
+  |    .messages=[sys,usr,asst]  |                               |
+  |    .tokens=shape([N+M+K])    |                               |
+  |                              |                               |
+  (repeat step() for sample, query, answer...)
+```
+### Flow: Action Detection
+```text
+User message string
+        |
+        v
+  _detect_action_type(content)
+        |
+        +-- contains "describe"/"schema"/"columns"?  --> "describe"
+        |
+        +-- contains "sample"/"example"/"rows"?      --> "sample"
+        |
+        +-- default                                  --> "query"
+```
+### Flow: Client Serialization (WebSocket Transport)
+```text
+  Client                                      Server
+    |                                           |
+    |  _step_payload(action):                   |
+    |    tokens: Tensor -> list (JSON-safe)     |
+    |    {action_type, action_description,      |
+    |     tokens: [[1,2,3,...]], metadata}       |
+    |  ---------------------------------------->|
+    |                                           |
+    |  _parse_result(data):                     |
+    |    tokens: list -> Tensor                 |
+    |    StepResult(obs, reward, done, info)     |
+    |  <----------------------------------------|
+```
+---
+## Shared Data Models
+These three Pydantic models are used across client, server, and tests.
+Defined in `models.py`.
+### SQLAction
+```python
+class SQLAction(Action):
+    action_type: str         # "describe" | "sample" | "query" | "answer"
+    action_description: str  # raw user message content
+    tokens: torch.Tensor     # tokenized conversation context, shape [1, seq_len]
+```
+**Used by:** SQLEnvironment.step(), SQLEnvClient._step_payload(), tests
+### SQLObservation
+```python
+class SQLObservation(Observation):
+    messages: list[Message]  # full conversation history [{role, content}, ...]
+    tokens: torch.Tensor     # flattened 1D tensor of all turn tokens concatenated
+```
+**Used by:** SQLEnvironment.reset()/step(), SQLEnvClient._parse_result(), tests
+### SQLState
+```python
+class SQLState(State):
+    episode_id: str                    # UUID per episode
+    step_count: int                    # turns taken
+    history_messages: list[Message]    # accumulates across turns
+    history_tokens: list[torch.Tensor] # one tensor per turn, flattened on output
+    current_action_type: str | None    # last detected action type
+```
+**Used by:** SQLEnvironment (internal), state endpoint
+**Note:** This is a lightweight summary for logging. The full RL state lives inside SQLEnvironment and is not exposed to the agent.
+---
+## API Contracts
+### WebSocket (OpenEnv Protocol)
+The server exposes a WebSocket endpoint via FastAPI. The OpenEnv framework handles the protocol — SQLEnv implements `reset()` and `step()` on the server side, and `SQLEnvClient` wraps the client side.
+| Operation | Client Method | Payload | Response |
+|-----------|---------------|---------|----------|
+| Reset | `client.reset()` | `{}` | `SQLObservation` (JSON) |
+| Step | `client.step(action)` | `{action_type, action_description, tokens: list, metadata}` | `StepResult(obs, reward, done, info)` |
+| State | `client.state()` | `{}` | `SQLState` (JSON) |
+### Ollama (Optional)
+| Endpoint | Purpose | Payload |
+|----------|---------|---------|
+| `POST /api/generate` | Table selection | `{model, prompt, stream: false}` |
+| `POST /api/generate` | SQL generation | `{model, prompt, stream: false}` |
+Timeout: 30s. Failure mode: graceful fallback (never crashes).
+---
+## Cross-Cutting Concerns
+### Code Style & Abstraction Philosophy
+OOP for framework integration (Environment, EnvClient subclasses), plain methods for logic. Extract helpers when they clarify intent, not for DRY.
+- **Structure:** Flat package root with `server/` for server-only code
+- **Error handling:** Graceful fallbacks (never crash), `ValueError` for invalid inputs
+- **Imports:** `try: from sql_env.X / except: from X` for dual install/Docker compatibility
+### Tokenization
+Two paths, same interface (`apply_chat_template`):
+| Mode | Tokenizer | Source | When |
+|------|-----------|--------|------|
+| Dev/Test | `MockTokenizer` | `server/test_sql_env.py` | No GPU, no downloads |
+| Production | HuggingFace | `transformers` library | Real RL training |
+`MockTokenizer` encodes as `ord(c)` per character, decodes as `chr(t)`. Deterministic and fast.
+### Configuration
+| Variable | Required | Description | Default |
+|----------|----------|-------------|---------|
+| `OLLAMA_MODEL` | No | Ollama model name for SQL generation | `qwen2` |
+| `OLLAMA_BASE_URL` | No | Ollama API endpoint | `http://localhost:11434` |
+---
+## Data, State, and Storage Locations
+- **Repo (committed):**
+  - `data/questions/student_assessment.json` — 53 Spider Q&A pairs
+  - `data/databases/models.py` — 9 SQLAlchemy ORM table definitions
+- **Runtime state (in-memory, per episode):**
+  - `SQLState.history_messages` — conversation messages
+  - `SQLState.history_tokens` — tensor per turn
+- **Not yet implemented:**
+  - SQLite database files (Phase 3 — queries currently go through Ollama, not executed locally)
+  - Reward/verification state
+---
+## Invariants and Guardrails
+- `self.db_models` refers to **database table** models (SQLAlchemy), never RL models
+- Token tensors grow monotonically across turns (never shrink or reset mid-episode)
+- `message_to_action()` mutates state — it appends to history before tokenizing
+- Ollama failures never crash the environment — always graceful fallback
+- `tests/test_smoke.py` must pass without Ollama, without GPU, without network
+- Schema column names in `_build_schema_description()` must match `data/databases/models.py`
+---
+## Glossary
+| Term | Definition |
+|------|------------|
+| Episode | One question-answering session: reset -> N steps -> terminal |
+| Action type | One of: describe, sample, query, answer |
+| MockTokenizer | Deterministic char-code tokenizer for dev/test (no GPU) |
+| Spider | Academic text-to-SQL benchmark dataset |
+| ORM models | SQLAlchemy class definitions in `data/databases/models.py` |
+| OpenEnv | Meta's RL environment framework (Environment, EnvClient, Action, Observation) |
+---
+## Infrastructure
+### Development
+**Prerequisites:**
+- Python 3.11-3.12 (torch incompatible with 3.13)
+- `uv` package manager
+- Ollama (optional)
+**Setup:**
+```bash
+git clone <repo-url> && cd sql-env
+uv sync
+uv run pytest tests/ -v    # 21 tests, ~3.5s, no external deps
+```
+### Production
+**Deployment:** Docker container via OpenEnv CLI (`openenv build` / `openenv push`)
+**Runtime:** FastAPI on port 8000 (defined in `openenv.yaml`)
+**Status:** Dockerfile is a scaffold stub — not yet validated
+---
+## Suggested Feature Breakdown
+| ID | Feature | Complexity | Dependencies | Notes |
+|----|---------|------------|--------------|-------|
+| F001 | SQL query execution | standard | - | Execute queries against real SQLite, return results |
+| F002 | Reward computation | standard | F001 | 3-layer reward: operational, progress, terminal |
+| F003 | Answer verification | standard | F001 | Compare agent answer to gold SQL results |
+| F004 | Docker validation | simple | - | Update Dockerfile, test `openenv build` |
+| F005 | Multi-database support | complex | F001 | Load any Spider database, not just student_assessment |
+### Suggested Implementation Order
+1. **F001** — Foundation: wire up SQLite execution so queries return real data
+2. **F002 + F003** — Can be done in parallel once F001 is complete
+3. **F004** — Independent, can be done anytime
+4. **F005** — After the single-database path is solid
+---
+## Future Considerations
+- **Real SQLite execution:** Queries currently go to Ollama for SQL generation but aren't executed against a database. Phase 3 should execute the generated SQL and return actual results.
+- **Multi-episode batching:** For RL training, the environment will need to support multiple concurrent episodes efficiently.
+- **Reward shaping:** The 3-layer reward (operational, progress, terminal) is designed in `models.py` but not implemented.
+- **Table selection without Ollama:** A lightweight keyword/embedding-based table selector could replace the LLM fallback.
+---
+## Keeping This Map Current
+Update this file when you change any of:
+- System boundaries (new service, new subsystem)
+- Persistent state locations (new files/dirs written or read)
+- Shared data models or API contracts
+- Cross-cutting invariants
+---
+## References
+- Docs index: `docs/README.md`
+- Operations: `docs/RUNBOOK.md`
+- OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
+- Spider dataset: https://huggingface.co/datasets/xlangai/spider

docs/README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# Docs
+This directory is the system-of-record for durable project knowledge.
+## Quick Links
+| Category | Index | Type | Purpose |
+|----------|-------|------|---------|
+| **Guides** | [guides/README.md](guides/README.md) | how-to | Practical step-by-step procedures |
+| **Design** | [design-docs/index.md](design-docs/index.md) | explanation | Feature design, ADRs, decision rationale |
+| **ADR Template** | [design-docs/decisions/0001-template.md](design-docs/decisions/0001-template.md) | reference | Decision record template |
+| **References** | [references/README.md](references/README.md) | reference | External docs for agent context |
+## System Docs
+- Architecture: [ARCHITECTURE.md](ARCHITECTURE.md)
+- Operations: [RUNBOOK.md](RUNBOOK.md)
+## Directory Structure
+```
+docs/
+├── README.md                    # This file (index)
+├── ARCHITECTURE.md              # System design overview [reference]
+├── RUNBOOK.md                   # Operations guide [how-to]
+├── guides/                      # How-to guides [how-to]
+│   └── README.md                # Guide index
+├── design-docs/                 # Decision rationale [explanation]
+│   ├── index.md                 # Design docs catalogue
+│   └── decisions/               # Architectural Decision Records
+└── references/                  # External docs [reference]
+    └── README.md                # External docs for agent context
+```
+## Adding Documentation
+| If you need... | Create in... | Type |
+|----------------|--------------|------|
+| Step-by-step procedure | `docs/guides/<topic>.md` | how-to |
+| Design for a feature | `docs/design-docs/<feature>.md` | explanation |
+| External library docs | `docs/references/<library>-llms.txt` | reference |

docs/RUNBOOK.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Runbook
+Operational notes: how to run, test, and debug day-to-day.
+## Common Commands
+```bash
+# Run tests (project package manager)
+uv run pytest tests/ -v
+```

docs/blog-outline.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# SQLEnv Blog Post Outline
+## 1) Hook: Teaching AI to Think Like a Data Analyst
+- Open with a concrete moment: an agent sees a new schema and must reason through uncertainty instead of guessing one SQL query.
+- Frame the core idea: SQL competence is not only syntax generation; it is iterative investigation with feedback.
+- Position SQLEnv as a training ground where agents learn exploration habits that mirror analyst workflows.
+## 2) The Problem: Static Benchmarks Reward Memorization
+- Explain why single-shot text-to-SQL can hide brittle behavior when schemas, table names, or data distributions shift.
+- Show that leaderboard accuracy does not guarantee robust reasoning on unfamiliar databases.
+- Describe the gap: most benchmarks grade final answers but ignore how the model arrived there.
+- Tie this directly to user pain: correct-looking SQL can fail in real environments where context changes every session.
+## 3) Our Approach: SQLEnv as an Interactive RL Environment
+- Introduce the action loop: `DESCRIBE`, `SAMPLE`, `QUERY`, and `ANSWER` as the minimum interface for grounded exploration.
+- Explain that each episode starts with a natural-language question and a hidden schema to force discovery.
+- Highlight OpenEnv compatibility so the environment can run with standard training tooling and deployment flows.
+## 4) How SQLEnv Works End-to-End
+- Walk through one episode narrative: inspect table shapes, sample data, run targeted joins, then submit an answer.
+- Summarize reward design in plain language: reward reliable execution, reward progress toward the goal, and strongly reward final correctness.
+- Note guardrails: read-only SQL execution, query timeout, and clear error messages to prevent unsafe or confusing behavior.
+## 5) Training with GRPO
+- Briefly explain GRPO as a practical policy optimization method for improving multi-step tool use behavior.
+- Connect training signals to environment telemetry: each step gives usable feedback rather than waiting for terminal reward only.
+- Clarify expected outcome: strategic behavior should improve over random baselines even with modest compute.
+## 6) Results
+- [PLACEHOLDER: Insert F006 metrics for success rate, average reward, and episode efficiency.]
+- Compare random baseline, trained policy, and oracle policy to show both practical gains and theoretical ceiling.
+- Include one short failure case to show where the policy still struggles and why that insight is useful.
+## 7) Technical Highlights
+- Multi-database Spider coverage with structured metadata and deterministic train/eval split.
+- Typed action and observation models that make environment interactions explicit and debuggable.
+- Deployment-ready packaging for HuggingFace Spaces with bundled databases and health checks.
+## 8) Try It Yourself
+- HuggingFace Space: add live link and a one-line instruction for connecting and running a first episode.
+- Colab notebook: link `notebooks/train_grpo.ipynb` with notes on expected runtime and CPU compatibility.
+- GitHub repository: link setup steps, architecture docs, and verification artifacts for reproducibility.
+## 9) What We Learned
+- Dense intermediate rewards improve learning speed only when they align with the final objective.
+- Tool-using agents benefit from transparent errors; better diagnostics create better policy updates.
+- Packaging and storytelling matter: a reproducible deployment and clear narrative are as important as benchmark numbers for adoption.

docs/design-docs/decisions/0001-template.md ADDED Viewed

	@@ -0,0 +1,26 @@

+# ADR 0001: <Title>
+## Status
+- Proposed | Accepted | Rejected | Deprecated
+## Context
+Describe the problem and constraints.
+## Decision
+What we decided and why.
+## Consequences
+What gets better, what gets worse, what we need to watch.
+## Alternatives Considered
+List viable alternatives and why they were not chosen.
+## Links
+- Related spec(s):
+- Related PR(s):

docs/design-docs/index.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# Design Docs
+This directory contains design documentation for architectural decisions — the WHY behind technical choices.
+## Core Beliefs
+See [core-beliefs.md](core-beliefs.md) for agent-first operating principles.
+## Decisions (ADRs)
+Architectural Decision Records are stored in [decisions/](decisions/).
+| ADR | Title | Status |
+|-----|-------|--------|
+| [0001](decisions/0001-template.md) | ADR Template | Template |
+## Feature Design Docs
+| Feature | Status | Date | Reversibility |
+|---------|--------|------|---------------|
+| *None yet* | | | |
+## Creating Design Docs
+Use the `design-doc` skill for structured decision documentation:
+```
+skill({ name: "design-doc" })
+```
+The skill guides you through:
+1. **Context** — What's the situation? What triggered this?
+2. **Decision Drivers** — Constraints, preferences, quality attributes
+3. **Options Analysis** — At least 2 options with pros/cons
+4. **Decision** — Choice + rationale + consequences + reversibility
+5. **Implementation Guidance** — Key interfaces, boundaries
+## When to Create a Design Doc
+**CREATE when:**
+- Making an architectural choice with multiple valid options
+- Introducing a new pattern or abstraction
+- Choosing between technologies, libraries, or approaches
+- A decision will affect multiple features
+**SKIP when:**
+- Following an existing established pattern
+- The decision is trivial or easily reversed
+- A simple code comment would suffice
+## Integration with Autocode
+The `autocode-implementation-planner` skill automatically reads linked design docs:
+- Uses constraints as hard requirements
+- Respects the chosen interfaces
+- Stays within the defined boundaries
+- Notes reversibility for future refactoring

docs/guides/README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# How-To Guides
+Practical, goal-oriented guides for getting things done. Each guide addresses a specific task or workflow.
+**Diataxis type:** How-to (action + application of skill)
+## Index
+| Guide | Goal |
+|-------|------|
+| *None yet* | |
+## What Goes Here
+- Step-by-step instructions for achieving a specific goal
+- Operational procedures (deploy, configure, troubleshoot)
+- Workflow walkthroughs
+## What Does NOT Go Here
+- Learning-oriented content (tutorials)
+- Factual descriptions of APIs/interfaces (go to `docs/references/`)
+- Decision rationale (go to `docs/design-docs/`)
+- Exploratory notes (go to `docs/exploration/`)

docs/learnings/F007-architecture.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ - Runtime images for OpenEnv/HF deployments should copy both `.venv` and `data/databases` into `/app/env` so environment logic and SQLite assets ship together for executable episodes and health validation (F007)

docs/learnings/F007-conventions.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ - Submission-facing notebooks must be Colab-ready by using relative project paths, cleared cell outputs, and a fixed section order (setup -> config -> connect -> train -> eval -> plot) to keep artifacts reproducible and reviewable (F007)
2	+ - README top sections should provide a three-command verification path (`uv sync`, `openenv validate`, `pytest`) before deep docs so judges can validate environment viability quickly (F007)

docs/learnings/F007-gotchas.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ - Hardcoding port 8000 in container startup or health checks can cause false-negative readiness on HuggingFace Spaces where `PORT=7860` is injected at runtime (F007)
2	+ - API health checks can report green while episodes still fail unless probes also assert at least one bundled `.sqlite` file exists under `data/databases` (F007)*

docs/learnings/F007-integrations.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ - HuggingFace Spaces deployment must treat `PORT` as runtime-configurable and wire both `HEALTHCHECK` and `uvicorn` startup to `${PORT:-8000}` for local/HF parity (F007)
2	+ - Training notebooks should include an explicit `SQLEnvClient` connect/reset/step smoke test before GRPO runs to fail fast when environment connectivity is broken (F007)

docs/learnings/F007-security.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ - Run deployment containers as a non-root user (for example uid 10001) after `chown -R /app` to meet least-privilege expectations without breaking runtime file access (F007)

docs/learnings/F007-testing.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ - Structural notebook rewrites should be guarded by a notebook-focused E2E suite plus full `tests/` regression to catch both training-flow and system-wide integration drift (F007)

docs/learnings/F007-workflow.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ - Feature finalization should run both targeted E2E checks and full regression, then sync completion metadata in IMPLEMENTATION_SPEC execution status and FEATURES.json progress fields (F007)

docs/references/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# References
+External references and pointers that inform decisions.
+Add links here when they become useful across multiple features.

evaluation/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Public evaluation API for the green agent wrapper."""
+from .green_agent import EpisodeResult, EvaluationResult, Policy, RandomPolicy, evaluate
+__all__ = [
+    "Policy",
+    "RandomPolicy",
+    "EpisodeResult",
+    "EvaluationResult",
+    "evaluate",
+]

evaluation/green_agent.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""Core types for policy evaluation."""
+from __future__ import annotations
+from dataclasses import dataclass
+import random
+import re
+from typing import Callable, Protocol, runtime_checkable
+try:
+    from ..models import SQLAction, SQLObservation
+except ImportError:
+    try:
+        from models import SQLAction, SQLObservation  # type: ignore[no-redef]
+    except ImportError:
+        from sql_env.models import SQLAction, SQLObservation  # type: ignore[no-redef]
+@runtime_checkable
+class Policy(Protocol):
+    """Interface for policies used by the evaluator."""
+    def select_action(self, observation: SQLObservation) -> SQLAction:
+        """Choose one action for the current observation."""
+@dataclass(frozen=True)
+class EpisodeResult:
+    """Per-episode metrics from one evaluation run."""
+    episode_index: int
+    correct: bool
+    total_reward: float
+    steps: int
+    error: str | None = None
+@dataclass(frozen=True)
+class EvaluationResult:
+    """Aggregate evaluation metrics across all attempted episodes."""
+    success_rate: float
+    avg_reward: float
+    avg_steps: float
+    n_episodes: int
+    n_completed: int
+    episodes: list[EpisodeResult]
+class RandomPolicy:
+    """Built-in random baseline policy."""
+    _EXPLORATION_ACTIONS = ("DESCRIBE", "SAMPLE", "QUERY")
+    _ROW_PATTERN = re.compile(r"^\d+\.\s*(.+)$")
+    def __init__(self, seed: int | None = None) -> None:
+        self._rng = random.Random(seed)
+    def select_action(self, observation: SQLObservation) -> SQLAction:
+        if observation.budget_remaining <= 1:
+            return SQLAction(
+                action_type="ANSWER",
+                argument=self._random_answer(observation.result),
+            )
+        action_type = self._rng.choice(self._EXPLORATION_ACTIONS)
+        table_name = self._random_table(observation.schema_info)
+        if action_type == "QUERY":
+            safe_table_name = table_name.replace('"', '""')
+            argument = f'SELECT * FROM "{safe_table_name}" LIMIT 5'
+        else:
+            argument = table_name
+        return SQLAction(action_type=action_type, argument=argument)
+    def _random_table(self, schema_info: str) -> str:
+        table_names = self._extract_table_names(schema_info)
+        if not table_names:
+            return "unknown"
+        return self._rng.choice(table_names)
+    @classmethod
+    def _extract_table_names(cls, schema_info: str) -> list[str]:
+        table_names: list[str] = []
+        for line in schema_info.splitlines():
+            stripped = line.strip()
+            if not stripped.startswith("- "):
+                continue
+            candidate = stripped[2:]
+            if ":" in candidate:
+                candidate = candidate.split(":", maxsplit=1)[0]
+            candidate = candidate.strip()
+            if candidate:
+                table_names.append(candidate)
+        return table_names
+    def _random_answer(self, result_text: str) -> str:
+        candidates = self._extract_answer_candidates(result_text)
+        if not candidates:
+            return "unknown"
+        return self._rng.choice(candidates)
+    @classmethod
+    def _extract_answer_candidates(cls, result_text: str) -> list[str]:
+        candidates: list[str] = []
+        for line in result_text.splitlines():
+            match = cls._ROW_PATTERN.match(line.strip())
+            if not match:
+                continue
+            row_value = match.group(1).strip()
+            if not row_value:
+                continue
+            candidates.append(row_value)
+            split_values = [value.strip() for value in row_value.split("|")]
+            candidates.extend([value for value in split_values if value])
+        return candidates
+def evaluate(
+    env: object,
+    policy: Policy,
+    n_episodes: int = 100,
+    *,
+    seed: int | None = None,
+    progress_callback: Callable[[int, int], None] | None = None,
+) -> EvaluationResult:
+    """Run policy evaluation over multiple episodes with error isolation."""
+    if n_episodes < 0:
+        raise ValueError("n_episodes must be >= 0")
+    if n_episodes == 0:
+        return EvaluationResult(
+            success_rate=0.0,
+            avg_reward=0.0,
+            avg_steps=0.0,
+            n_episodes=0,
+            n_completed=0,
+            episodes=[],
+        )
+    episodes: list[EpisodeResult] = []
+    for episode_index in range(n_episodes):
+        try:
+            episode_seed = seed + episode_index if seed is not None else None
+            observation = env.reset(seed=episode_seed)
+            total_reward = 0.0
+            steps = 0
+            while not observation.done:
+                action = policy.select_action(observation)
+                observation = env.step(action)
+                total_reward += observation.reward or 0.0
+                steps += 1
+            episodes.append(
+                EpisodeResult(
+                    episode_index=episode_index,
+                    correct=(observation.reward or 0.0) > 0.0,
+                    total_reward=total_reward,
+                    steps=steps,
+                )
+            )
+        except Exception as exc:
+            episodes.append(
+                EpisodeResult(
+                    episode_index=episode_index,
+                    correct=False,
+                    total_reward=0.0,
+                    steps=0,
+                    error=str(exc),
+                )
+            )
+        if progress_callback is not None:
+            progress_callback(episode_index + 1, n_episodes)
+    completed_episodes = [episode for episode in episodes if episode.error is None]
+    n_completed = len(completed_episodes)
+    if n_completed == 0:
+        return EvaluationResult(
+            success_rate=0.0,
+            avg_reward=0.0,
+            avg_steps=0.0,
+            n_episodes=n_episodes,
+            n_completed=0,
+            episodes=episodes,
+        )
+    successful = sum(1 for episode in completed_episodes if episode.correct)
+    avg_reward = sum(episode.total_reward for episode in completed_episodes) / n_completed
+    avg_steps = sum(episode.steps for episode in completed_episodes) / n_completed
+    return EvaluationResult(
+        success_rate=successful / n_completed,
+        avg_reward=avg_reward,
+        avg_steps=avg_steps,
+        n_episodes=n_episodes,
+        n_completed=n_completed,
+        episodes=episodes,
+    )

models.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""
+SQLEnv Pydantic models — the data contracts between client and server.
+These models define the typed interface for the SQLEnv RL environment,
+following the OpenEnv pattern (see OpenEnv Tutorial for reference):
+    Action      — what the agent sends each step
+    Observation — what the agent receives back
+    State       — episode metadata (exposed via the state endpoint)
+RL terminology — state vs observation
+─────────────────────────────────────
+In RL theory:
+    State (s)       A COMPLETE description of the world. Nothing is hidden.
+    Observation (o) A PARTIAL description of a state, which may omit info.
+In SQLEnv these map to:
+    EpisodeContext  The full RL state (s). Lives on the server only.
+                    Contains gold answers, reward accumulators, DB
+                    connection, full query history — everything needed
+                    to advance the simulation and compute rewards.
+    SQLObservation  The observation (o). Sent to the agent over the wire.
+                    Contains the question, truncated results, revealed
+                    schema, budget, and action history. The agent NEVER
+                    sees the gold answer, progress scores, or full DB.
+    SQLState        OpenEnv's "State" base class — lightweight episode
+                    metadata (episode_id, step_count). This is NOT the
+                    RL state; it is a convenience for logging/debugging.
+This separation is what makes SQLEnv a POMDP: the agent must act under
+uncertainty, which is what makes exploration necessary and learnable.
+"""
+import sqlite3
+from dataclasses import dataclass, field as dataclass_field
+from openenv.core.env_server.interfaces import Message
+from openenv.core.env_server.types import Action, Observation, State
+from pydantic import Field
+import torch
+# ---------------------------------------------------------------------------
+# Wire types: these cross the HTTP boundary between client and server
+# ---------------------------------------------------------------------------
+class SQLAction(Action):
+    """What the agent sends each step.
+    The action space is intentionally small and structured so agents can
+    explicitly control the environment loop.
+    """
+    action_type: str = Field(
+        ...,
+        description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER",
+    )
+    argument: str = Field(
+        ...,
+        description=(
+            "Table name (DESCRIBE/SAMPLE), SQL string (QUERY), "
+            "or answer value (ANSWER)."
+        ),
+    )
+class SQLObservation(Observation):
+    """What the agent receives after each step.
+    This is the agent's PARTIAL view of the world. Key design choices:
+    - schema_info starts with table names only; columns are revealed
+      incrementally as the agent DESCRIBEs tables.
+    - result is always a truncated string, never raw data. The agent sees
+      what a human analyst would see in a terminal — at most N rows of
+      formatted text. This keeps the observation bounded and forces the
+      agent to reason about what it sees rather than brute-force scanning.
+    - action_history gives the agent memory of its own trajectory without
+      the server needing to re-send full results from prior steps.
+    """
+    # Inherited from Observation: done (bool), reward (float | None)
+    question: str = Field(..., description="The NL question to answer")
+    schema_info: str = Field(..., description="Known schema information")
+    result: str = Field(default="", description="Result of the last action")
+    error: str = Field(default="", description="Error message if action failed")
+    step_count: int = Field(default=0, description="Current step number")
+    budget_remaining: int = Field(default=0, description="Steps remaining")
+    action_history: list[str] = Field(
+        default_factory=list,
+        description="Summary of previous actions",
+    )
+class SQLState(State):
+    """Episode metadata exposed via GET /state.
+    This is the minimal public state — enough for logging and debugging,
+    but NOT the full internal bookkeeping (see EpisodeContext below).
+    """
+    # # Inherited from State: episode_id (str | None), step_count (int)
+    # game_name: str = Field(
+    #     "sql_env", description="Name of the game/environment"
+    # )
+    history_messages: list[Message] = Field(default_factory=list)
+    history_tokens: list[torch.Tensor] = Field(
+        default_factory=list
+    )  # Same len as messages
+    current_action_type: str = Field(
+        default="QUERY",
+        description="Current action type: DESCRIBE, SAMPLE, QUERY, or ANSWER",
+    )
+@dataclass
+class QuestionRecord:
+    """One question from the Spider dataset."""
+    question_id: str
+    question_text: str
+    database_name: str
+    gold_sql: str
+    gold_answer: str
+    answer_type: str
+    difficulty: str
+    tables_involved: list[str]
+@dataclass
+class EpisodeContext:
+    """Per-episode server-side state (never sent to agent)."""
+    episode_id: str
+    db_connection: sqlite3.Connection
+    question_record: QuestionRecord
+    step_count: int = 0
+    budget: int = 15
+    described_tables: set[str] = dataclass_field(default_factory=set)
+    action_log: list[str] = dataclass_field(default_factory=list)
+    done: bool = False
+    gold_answer: str | None = None
+    gold_rows: list[tuple] = dataclass_field(default_factory=list)
+    query_hashes: set[str] = dataclass_field(default_factory=set)
+    best_progress: float = 0.0
+    cumulative_step_reward: float = 0.0
+    cumulative_new_info_reward: float = 0.0
+# ---------------------------------------------------------------------------
+# Conceptual internal state: what the server tracks per episode
+# ---------------------------------------------------------------------------
+#
+# The classes below are a DESIGN OUTLINE, not runnable implementation.
+# They describe the information the server needs to maintain during an
+# episode so that it can:
+#
+#   1. Execute actions against the database
+#   2. Compute the 3-layer reward signal
+#   3. Enforce budget limits and anti-gaming measures
+#   4. Build the next observation for the agent
+#
+# These are SERVER-ONLY — they never cross the HTTP boundary.
+# Implementation will follow in server/environment.py during Phase 2.
+#
+#
+# EpisodeContext — Per-episode server state
+# ──────────────────────────────────────────
+# Conceptual fields:
+#
+#   episode_id: str
+#       Unique identifier for this episode (UUID).
+#
+#   question_record: QuestionRecord
+#       The selected question and its metadata:
+#         - question_id, question_text, database_name
+#         - gold_sql, gold_answer, answer_type, difficulty
+#       Loaded from the question set JSON at reset().
+#
+#   db_connection: sqlite3.Connection
+#       Read-only connection to the episode's SQLite database.
+#       Opened at reset(), closed when the episode ends.
+#       Enforces: read-only mode, statement timeout (5s), SELECT-only.
+#
+#   step_count: int
+#       Current step number (0 at reset, incremented each step()).
+#
+#   budget: int
+#       Steps remaining. Starts at max_steps (default 15).
+#       Decremented on each non-ANSWER action. Episode terminates
+#       when budget hits 0 without an ANSWER.
+#
+#   --- Schema tracking (for observation building) ---
+#
+#   known_tables: set[str]
+#       Table names revealed to the agent. Starts with ALL table names
+#       (agent sees table names at reset), but column details are hidden.
+#
+#   described_tables: dict[str, list[ColumnInfo]]
+#       Tables the agent has DESCRIBEd → their column info.
+#       Used to build the incrementally-revealed schema_info string.
+#
+#   --- Reward tracking (Layer 1: Operational) ---
+#
+#   query_hashes: set[str]
+#       Hashes of all SQL queries executed this episode.
+#       Used for repeat detection (r_repeat penalty).
+#
+#   explored_entities: set[str]
+#       Set of "table.column" strings the agent has discovered.
+#       Used for r_new_info reward. Capped at 0.10 total per episode.
+#
+#   cumulative_new_info_reward: float
+#       Running total of r_new_info awarded. Once this reaches the cap
+#       (0.10), no more r_new_info is given.
+#
+#   --- Reward tracking (Layer 2: Progress) ---
+#
+#   gold_result: Any
+#       The result of running gold_sql on the database, computed once
+#       at reset(). This is the reference for progress comparison.
+#
+#   best_progress: float
+#       Best binned progress score achieved so far (one of
+#       {0, 0.25, 0.5, 0.75, 1.0}). Reward is given only when
+#       a QUERY result IMPROVES over this value.
+#
+#   --- Reward tracking (aggregates) ---
+#
+#   cumulative_step_reward: float
+#       Running sum of all per-step rewards (Layers 1 + 2).
+#       Clamped to [-0.2, +0.5] at episode end.
+#
+#   --- Action history (for observation) ---
+#
+#   action_log: list[str]
+#       Human-readable summaries of each action taken, e.g.:
+#         "DESCRIBE employees → 5 columns"
+#         "QUERY: SELECT COUNT(*) FROM orders → 42"
+#         "ANSWER: 42 → correct"
+#       Sent to the agent in SQLObservation.action_history so it has
+#       memory of its own trajectory.
+#
+#
+# QuestionRecord — Metadata for a single question
+# ─────────────────────────────────────────────────
+# Conceptual fields:
+#
+#   question_id: str          e.g. "spider_dev_042"
+#   question_text: str        The natural language question
+#   database_name: str        Which SQLite database to load
+#   gold_sql: str             Reference SQL (hidden from agent)
+#   gold_answer: str          Expected answer (hidden from agent)
+#   answer_type: str          One of: integer, float, string, list, table
+#   difficulty: str           One of: easy, medium, hard
+#   tables_involved: list[str]  Which tables the gold query touches
+#
+#
+# ColumnInfo — Schema detail for a single column
+# ───────────────────────────────────────────────
+# Conceptual fields:
+#
+#   name: str                 Column name
+#   dtype: str                SQLite type (TEXT, INTEGER, REAL, etc.)
+#   is_primary_key: bool      Whether this is a PK
+#   is_foreign_key: bool      Whether this is a FK
+#   references: str | None    "table.column" if FK, else None
+#

notebooks/train_grpo.ipynb ADDED Viewed

	@@ -0,0 +1,226 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Training a SQL Agent with GRPO + SQLEnv\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)\n",
+    "\n",
+    "This notebook is a Colab-ready walkthrough for training an agent against SQLEnv. It follows setup, configuration, connectivity check, training, evaluation, and plotting in one linear flow."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1) Setup\n",
+    "Install dependencies and (optionally) clone the repository when running in a fresh Colab runtime."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -q \"trl>=0.9.0\" \"transformers>=4.46.0\" \"datasets>=3.0.0\" \"matplotlib>=3.8.0\" \"openenv>=0.1.9\" \"websockets>=15.0.1\"\n",
+    "\n",
+    "# Optional in Colab if project files are not already present:\n",
+    "# !git clone https://github.com/<your-org>/<your-repo>.git\n",
+    "# %cd <your-repo>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2) Configuration\n",
+    "Set environment URL, model, and core training hyperparameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from sql_env.client import SQLEnvClient\n",
+    "from sql_env.training.config import GRPOConfig\n",
+    "from sql_env.training.data_loading import load_model_and_tokenizer, load_question_prompts\n",
+    "from sql_env.training.notebook_pipeline import build_trainer, run_training_with_metrics, sample_random_baseline\n",
+    "from sql_env.training.rewards import reward_correctness, reward_operational, reward_progress\n",
+    "\n",
+    "try:\n",
+    "    from trl import GRPOConfig as TRLGRPOConfig\n",
+    "    from trl import GRPOTrainer\n",
+    "except Exception as exc:\n",
+    "    raise RuntimeError(\n",
+    "        \"TRL is required for this notebook. Install dependencies in the Setup cell first.\"\n",
+    "    ) from exc\n",
+    "\n",
+    "SPACE_URL = \"ws://localhost:8000/ws\"\n",
+    "MODEL_NAME = \"Qwen/Qwen3-0.6B\"\n",
+    "\n",
+    "# TODO: update after F006 if artifact paths or defaults change.\n",
+    "config = GRPOConfig(\n",
+    "    questions_path=\"data/questions/questions_train.json\",\n",
+    "    db_dir=\"data/databases\",\n",
+    "    output_dir=\"outputs/grpo_run\",\n",
+    "    model_name=MODEL_NAME,\n",
+    "    num_train_epochs=1,\n",
+    "    per_device_train_batch_size=1,\n",
+    "    gradient_accumulation_steps=1,\n",
+    "    num_generations=2,\n",
+    "    step_budget=10,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3) Connect and Smoke Test\n",
+    "Confirm the environment is reachable and can execute a short episode."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = SQLEnvClient(base_url=SPACE_URL)\n",
+    "client.connect()\n",
+    "obs = client.reset(seed=42)\n",
+    "print(\"Question:\", obs.question)\n",
+    "\n",
+    "_ = client.step(\"DESCRIBE student\")\n",
+    "_ = client.step(\"SAMPLE student\")\n",
+    "\n",
+    "client.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4) Train with GRPO\n",
+    "Build a trainer and run a short training pass."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model, tokenizer = load_model_and_tokenizer(config.model_name)\n",
+    "prompts = load_question_prompts(config.questions_path, config.difficulty_filter)\n",
+    "\n",
+    "before_rollouts = sample_random_baseline([item[\"prompt\"] for item in prompts[:8]], step_budget=config.step_budget, seed=config.seed)\n",
+    "\n",
+    "reward_funcs = [reward_correctness, reward_progress, reward_operational]\n",
+    "trainer = build_trainer(\n",
+    "    trl_grpo_config_cls=TRLGRPOConfig,\n",
+    "    grpo_trainer_cls=GRPOTrainer,\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    prompts=prompts,\n",
+    "    config=config,\n",
+    "    reward_funcs=reward_funcs,\n",
+    ")\n",
+    "\n",
+    "# TODO: update after F006 if training entry points are renamed.\n",
+    "train_output, steps, rewards = run_training_with_metrics(trainer)\n",
+    "print(train_output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5) Evaluate\n",
+    "Run a quick held-out evaluation summary after training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "held_out_prompts = [item[\"prompt\"] for item in load_question_prompts(\"data/questions/questions_eval.json\", None)[:16]]\n",
+    "after_rollouts = sample_random_baseline(held_out_prompts, step_budget=config.step_budget, seed=config.seed + 1)\n",
+    "\n",
+    "baseline_avg_steps = sum(len(item[\"completion\"].splitlines()) for item in before_rollouts) / max(1, len(before_rollouts))\n",
+    "eval_avg_steps = sum(len(item[\"completion\"].splitlines()) for item in after_rollouts) / max(1, len(after_rollouts))\n",
+    "\n",
+    "print({\n",
+    "    \"baseline_avg_steps\": round(baseline_avg_steps, 2),\n",
+    "    \"held_out_avg_steps\": round(eval_avg_steps, 2),\n",
+    "    \"eval_count\": len(after_rollouts),\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6) Plot Results\n",
+    "Visualize the reward trend collected during training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if steps and rewards:\n",
+    "    plt.figure(figsize=(8, 4))\n",
+    "    plt.plot(steps, rewards, marker=\"o\", linewidth=1.5)\n",
+    "    plt.title(\"GRPO Reward Trend\")\n",
+    "    plt.xlabel(\"Training Step\")\n",
+    "    plt.ylabel(\"Reward\")\n",
+    "    plt.grid(alpha=0.3)\n",
+    "    plt.show()\n",
+    "else:\n",
+    "    print(\"No reward points available yet.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Next Steps\n",
+    "- Full training workflow: `specs/F006-IMPLEMENTATION_SPEC.md`\n",
+    "- Deployment package: `specs/F007-IMPLEMENTATION_SPEC.md`\n",
+    "- Live environment endpoint: replace `SPACE_URL` with your HF Space WebSocket URL\n",
+    "- Blog narrative source: `docs/blog-outline.md`"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "name": "train_grpo.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

opencode.jsonc ADDED Viewed

	@@ -0,0 +1,283 @@

+{
+  "$schema": "https://opencode.ai/config.json",
+  // ============================================================================
+  // FULLSTACK AUTOCODE TEMPLATE
+  // ============================================================================
+  // For: FastAPI + Next.js projects with autonomous autocode workflow
+  // Copy to project root: cp ~/.config/opencode/templates/fullstack-autocode.jsonc ./opencode.jsonc
+  //
+  // This template is PERMISSIVE because verification comes from:
+  //   - VERIFICATION_SPEC.md (independent test criteria)
+  //   - review-modern subagent (auto-fix + bounded iteration)
+  //   - git history (atomic commits per step)
+  //
+  // NOT from permission prompts.
+  //
+  // For headless/CLI automation (ralph-loop.sh, opencode run), all tools that
+  // might prompt must be pre-approved. See docs/opencode-server-mode.md for
+  // details on server mode alternatives.
+  // ============================================================================
+  "permission": {
+    // Allow reading from global OpenCode assets (skills, commands, agents, scripts)
+    // Also allow specs/** and vision/** to prevent sandbox false-positives
+    // in parallel-feature clones where OpenCode may misidentify project root
+    "external_directory": {
+      "~/.config/opencode/skills/**": "allow",
+      "~/.config/opencode/commands/**": "allow",
+      "~/.config/opencode/agents/**": "allow",
+      "~/.config/opencode/scripts/**": "allow",
+      "specs/**": "allow",
+      "vision/**": "allow"
+    },
+    "read": "allow",
+    "glob": "allow",
+    "grep": "allow",  // Needed for codebase exploration
+    "list": "allow",  // Directory listing tool
+    "edit": "allow",  // Trust git as safety net
+    // Allow subagent invocation for autonomous workflows
+    // CRITICAL: Without this, /autocode-next-step will hang in CLI mode
+    "task": "allow",
+    // Allow skill loading (for complex multi-skill workflows)
+    "skill": "allow",
+    // Allow web fetching for documentation lookups (optional, set to "ask" if concerned)
+    "webfetch": "allow",
+    "bash": {
+      // Catch-all: ask for anything not explicitly allowed below
+      // This ensures unknown commands still prompt rather than fail silently
+      "*": "ask",
+      // ========================================================================
+      // TASK RUNNERS
+      // ========================================================================
+      "task": "allow",
+      "task *": "allow",
+      "make": "allow",
+      "make *": "allow",
+      // ========================================================================
+      // PYTHON / UV
+      // ========================================================================
+      "uv": "allow",
+      "uv *": "allow",
+      "uv sync": "allow",
+      "uv venv": "allow",
+      "uv run *": "allow",
+      "uv pip *": "allow",
+      "uv add *": "allow",
+      "uv remove *": "allow",
+      "uv lock *": "allow",
+      // Direct test/lint invocation (used by /techdebt and verification)
+      "uv run pytest": "allow",
+      "uv run pytest *": "allow",
+      "uv run ruff *": "allow",
+      "uv run mypy *": "allow",
+      "uv run black *": "allow",
+      // Direct invocation without uv (for projects not using uv)
+      "pytest": "allow",
+      "pytest *": "allow",
+      "ruff": "allow",
+      "ruff *": "allow",
+      "ruff check *": "allow",
+      "mypy": "allow",
+      "mypy *": "allow",
+      "black *": "allow",
+      "isort *": "allow",
+      // ========================================================================
+      // NODE / NPM / BUN
+      // ========================================================================
+      "npm install": "allow",
+      "npm ci": "allow",
+      "npm run dev": "allow",
+      "npm run build": "allow",
+      "npm run lint": "allow",
+      "npm run test": "allow",
+      "npm run test *": "allow",
+      "npm run start": "allow",
+      "npm run format": "allow",
+      "npm run typecheck": "allow",
+      "npm run typecheck *": "allow",
+      // ESLint direct invocation (used by /techdebt)
+      "npx eslint": "allow",
+      "npx eslint *": "allow",
+      "npm outdated": "allow",
+      "npm ls *": "allow",
+      "npm audit": "allow",
+      "npm audit *": "allow",
+      "bun install": "allow",
+      "bun run *": "allow",
+      "bun test": "allow",
+      "bun test *": "allow",
+      "bun add *": "allow",
+      "bun remove *": "allow",
+      // ========================================================================
+      // GIT - Full workflow (autonomous commits/push)
+      // ========================================================================
+      "git add *": "allow",
+      "git commit *": "allow",
+      "git push": "allow",
+      "git push *": "allow",
+      "git checkout *": "allow",
+      "git switch *": "allow",
+      "git branch": "allow",
+      "git branch *": "allow",
+      "git stash *": "allow",
+      "git pull": "allow",
+      "git pull *": "allow",
+      "git fetch *": "allow",
+      "git merge *": "allow",
+      "git rebase *": "allow",
+      "git tag *": "allow",
+      "git cherry-pick *": "allow",
+      // Git diagnostics (used by /commit-push-pr and /autocode-next-step)
+      "git status": "allow",
+      "git status *": "allow",
+      "git diff": "allow",
+      "git diff *": "allow",
+      "git log *": "allow",
+      "git rev-parse *": "allow",
+      "git rev-list *": "allow",
+      "git remote *": "allow",
+      "git show *": "allow",
+      "git ls-remote *": "allow",
+      // EXPLICIT DENY: Force push (destructive, stays as ask)
+      "git push --force": "ask",
+      "git push --force *": "ask",
+      "git push -f": "ask",
+      "git push -f *": "ask",
+      // ========================================================================
+      // GITHUB CLI - PR workflow (no merge)
+      // ========================================================================
+      "gh auth status": "allow",
+      "gh pr create *": "allow",
+      "gh pr view *": "allow",
+      "gh pr list *": "allow",
+      "gh pr checkout *": "allow",
+      "gh pr diff *": "allow",
+      "gh pr status": "allow",
+      "gh pr ready *": "allow",
+      "gh pr comment *": "allow",
+      "gh issue *": "allow",
+      "gh repo view *": "allow",
+      "gh repo clone *": "allow",
+      // EXPLICIT DENY: Merge and dangerous API calls (stay as ask)
+      // These inherit "ask" from global "*": "ask", but listed for clarity
+      // "gh pr merge *": "ask"
+      // "gh api *": "ask"
+      // ========================================================================
+      // DOCKER (common safe commands)
+      // ========================================================================
+      "docker build *": "allow",
+      "docker run *": "allow",
+      "docker ps": "allow",
+      "docker ps *": "allow",
+      "docker images": "allow",
+      "docker images *": "allow",
+      "docker logs *": "allow",
+      "docker exec *": "allow",
+      "docker stop *": "allow",
+      "docker start *": "allow",
+      "docker restart *": "allow",
+      "docker rm *": "allow",
+      "docker rmi *": "allow",
+      "docker compose *": "allow",
+      "docker-compose *": "allow",
+      // ========================================================================
+      // PYTHON (JSON validation, scripting)
+      // ========================================================================
+      "python3": "allow",
+      "python3 *": "allow",
+      "python": "allow",
+      "python *": "allow",
+      // ========================================================================
+      // FILE OPERATIONS (safe, commonly needed during development)
+      // ========================================================================
+      "mv *": "allow",
+      "mkdir *": "allow",
+      "mkdir -p *": "allow",
+      "cp *": "allow",
+      "cp -r *": "allow",
+      "rm *": "allow",
+      "rm -r *": "allow",
+      "rm -rf *": "allow",
+      "touch *": "allow",
+      // ========================================================================
+      // FILE/DIR CHECKS (used by scripts and agents)
+      // ========================================================================
+      "test *": "allow",
+      "test -f *": "allow",
+      "test -d *": "allow",
+      "test -e *": "allow",
+      "[ *": "allow",
+      // ========================================================================
+      // DIAGNOSTICS (inherited from global, but explicit for clarity)
+      // ========================================================================
+      "ls": "allow",
+      "ls *": "allow",
+      "cat *": "allow",
+      "head *": "allow",
+      "tail *": "allow",
+      "which *": "allow",
+      "pwd": "allow",
+      "echo *": "allow",
+      "tr *": "allow",
+      "wc *": "allow",
+      "true": "allow",
+      "false": "allow",
+      "grep *": "allow",
+      "find *": "allow",
+      "tree *": "allow",
+      "stat *": "allow",
+      "file *": "allow",
+      "basename *": "allow",
+      "dirname *": "allow",
+      "realpath *": "allow",
+      // ========================================================================
+      // RUST / CARGO (if applicable)
+      // ========================================================================
+      "cargo": "allow",
+      "cargo *": "allow",
+      "cargo build": "allow",
+      "cargo build *": "allow",
+      "cargo test": "allow",
+      "cargo test *": "allow",
+      "cargo clippy": "allow",
+      "cargo clippy *": "allow",
+      "cargo fmt": "allow",
+      "cargo fmt *": "allow",
+      "cargo check": "allow",
+      "cargo check *": "allow",
+      "cargo run": "allow",
+      "cargo run *": "allow",
+      // ========================================================================
+      // UTILITIES (timestamps for specs)
+      // ========================================================================
+      "date": "allow",
+      "date *": "allow"
+    }
+  },
+  "instructions": ["AGENTS.md"]
+}

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: sql_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

progress.log ADDED Viewed

	@@ -0,0 +1,29 @@

+[2026-03-28T18:00:24+0100] === Ralph Loop Start ===
+[2026-03-28T18:00:24+0100] Spec: specs/F007-IMPLEMENTATION_SPEC.md
+[2026-03-28T18:00:24+0100] Model: openai/gpt-5.3-codex
+[2026-03-28T18:00:24+0100] Max iterations: 20
+[2026-03-28T18:04:33+0100] Iteration 1/20 | Step: 1.1 | action=continue
+[2026-03-28T18:08:10+0100] Iteration 2/20 | Step: 1.3 | action=continue
+[2026-03-28T18:10:48+0100] Iteration 3/20 | Step: 1.3 | action=continue
+[2026-03-28T18:14:57+0100] Iteration 4/20 | Step: 2.1 | action=continue
+[2026-03-28T18:17:25+0100] Iteration 5/20 | Step: 2.2 | action=continue
+[2026-03-28T18:17:25+0100] === Ralph Loop Aborted === reason=Finalization stuck after 5 iterations
+[2026-03-28T21:04:43+0100] === Ralph Loop Start ===
+[2026-03-28T21:04:43+0100] Spec: specs/F007-IMPLEMENTATION_SPEC.md
+[2026-03-28T21:04:43+0100] Model: openai/gpt-5.3-codex
+[2026-03-28T21:04:43+0100] Max iterations: 20
+[2026-03-28T21:09:06+0100] Iteration 1/20 | Step: 3.1 | action=continue
+[2026-03-28T21:40:17+0100] Iteration 2/20 | Step: unknown | action=blocked | reason=External deployment verification is blocked by GHCR access/auth failure (403 pulling base image), so verifier gate cannot approve final completion yet.
+[2026-03-28T21:44:42+0100] Iteration 3/20 | Step: unknown | action=blocked | reason=External credential/access dependency remains: need authenticated GHCR pull and HF push evidence (build+push attempt) to satisfy final verifier approval.
+[2026-03-28T22:05:11+0100] Iteration 4/20 | Step: unknown | action=blocked | reason=Awaiting user-side authenticated deployment evidence: successful GHCR-authenticated `uv run openenv build -t openenv-sql-env-f007-hf-submission` and `uv run openenv push` output before verifier/final completion can proceed.
+[2026-03-28T22:49:48+0100] Iteration 5/20 | Step: unknown | action=blocked | reason=Awaiting user-provided authenticated external deployment evidence (GHCR-authenticated `openenv build` success and `openenv push` output) to satisfy final verifier gate for F007.
+[2026-03-28T22:50:20+0100] === Ralph Loop Start ===
+[2026-03-28T22:50:20+0100] Spec: specs/F007-IMPLEMENTATION_SPEC.md
+[2026-03-28T22:50:20+0100] Model: openai/gpt-5.3-codex
+[2026-03-28T22:50:20+0100] Max iterations: 20
+[2026-03-28T22:54:21+0100] Iteration 1/20 | Step: unknown | action=blocked | reason=Missing external authenticated deployment evidence (GHCR-authenticated build and Hugging Face push output) required by F007 final verification gate.
+[2026-03-28T23:00:44+0100] Iteration 2/20 | Step: unknown | action=blocked | reason=Authenticated deployment attempts now run, but `openenv build` fails with local Docker disk exhaustion (`No space left on device`) and `openenv push` fails with HF namespace permission (`403 Forbidden` for `hjerpe/sql_env`) plus README frontmatter metadata validation (`colorFrom`/`colorTo`), so final verification gate cannot pass without external intervention.
+[2026-03-28T23:14:35+0100] === Ralph Loop Start ===
+[2026-03-28T23:14:35+0100] Spec: specs/F007-IMPLEMENTATION_SPEC.md
+[2026-03-28T23:14:35+0100] Model: openai/gpt-5.3-codex
+[2026-03-28T23:14:35+0100] Max iterations: 20

pyproject.toml ADDED Viewed

	@@ -0,0 +1,69 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "sql-env"
+version = "0.1.0"
+description = "Interactive SQL exploration RL environment for the OpenEnv Challenge"
+requires-python = ">=3.11,<3.13"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    "openenv-core[core]>=0.2.1",
+    # Environment-specific dependencies
+    "pydantic>=2.0.0",
+    "fastapi>=0.104.0",
+    "uvicorn>=0.24.0",
+    "torch==2.2.2",
+    "transformers<5",
+    "numpy<2",
+    "requests>=2.31.0",
+    "sqlalchemy>=2.0.47",
+    "jupyter>=1.1.1",
+    "notebook>=7.5.5",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+    "ruff>=0.4.0",
+]
+training = [
+    "trl>=0.14.0,<0.15.0",
+    "accelerate>=0.34.0",
+    "matplotlib>=3.7.0",
+]
+[project.scripts]
+# Server entry point — enables: uv run server
+server = "sql_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = [
+    "sql_env",
+    "sql_env.server",
+    "sql_env.data",
+    "sql_env.data.databases",
+]
+package-dir = { "sql_env" = ".", "sql_env.server" = "server", "sql_env.data" = "data", "sql_env.data.databases" = "data/databases" }
+[tool.ruff]
+line-length = 88
+exclude = ["scripts/"]
+[tool.ruff.lint]
+select = ["E", "F", "W"]
+[tool.ruff.lint.per-file-ignores]
+# SQL schema strings and LLM prompts are intentionally long
+"server/sql_environment.py" = ["E501"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+pythonpath = ["."]
+addopts = "--import-mode=importlib"
+markers = [
+    "slow: integration or long-running tests",
+]

scripts/curate_questions.py ADDED Viewed

	@@ -0,0 +1,921 @@

+"""Curate multi-database Spider questions for SQLEnv."""
+from __future__ import annotations
+import argparse
+import io
+import json
+import logging
+import re
+import sqlite3
+import time
+import zipfile
+from collections.abc import Iterable
+from pathlib import Path
+from typing import Any, Callable
+from urllib.parse import quote
+import requests
+SPIDER_SQLITE_URLS = (
+    "https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite",
+    "https://github.com/taoyds/spider/raw/master/database/{db_id}/{db_id}.sqlite",
+)
+SPIDER_DATASET_FILE_ID = "1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J"
+SPIDER_DATASET_DOWNLOAD_URL = "https://drive.usercontent.google.com/download"
+SQLITE_MAGIC_HEADER = b"SQLite format 3\x00"
+DB_ID_PATTERN = re.compile(r"^[A-Za-z0-9_]+$")
+TABLE_TOKEN_PATTERN = re.compile(
+    r"\b(?:FROM|JOIN)\s+([`\"\[]?[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)?[`\"\]]?)",
+    flags=re.IGNORECASE,
+)
+CTE_ALIAS_PATTERN = re.compile(
+    r"(?:\bWITH\b|,)\s*([A-Za-z_][A-Za-z0-9_]*)\s+AS\s*\(",
+    flags=re.IGNORECASE,
+)
+TRAIN_SPLIT = "train"
+EVAL_SPLIT = "eval"
+VALID_SPLITS = {TRAIN_SPLIT, EVAL_SPLIT}
+VALID_ANSWER_TYPES = {"integer", "float", "string", "list", "table"}
+VALID_DIFFICULTIES = {"easy", "medium", "hard"}
+REQUIRED_FIELDS = (
+    "question_id",
+    "question_text",
+    "database_name",
+    "gold_sql",
+    "gold_answer",
+    "answer_type",
+    "difficulty",
+    "tables_involved",
+    "split",
+)
+LOGGER = logging.getLogger(__name__)
+_SPIDER_ARCHIVE_BYTES: bytes | None = None
+def _normalize_table_name(raw_table: str) -> str:
+    """Normalize a table token extracted from SQL text."""
+    token = raw_table.strip().strip('`"[]')
+    if "." in token:
+        token = token.split(".", maxsplit=1)[1]
+    return token
+def _validate_db_id(db_id: str) -> None:
+    """Validate that ``db_id`` is safe for filesystem usage."""
+    if not DB_ID_PATTERN.fullmatch(db_id):
+        raise ValueError(f"Invalid db_id '{db_id}'. Expected [A-Za-z0-9_]+")
+def _is_valid_sqlite_file(path: Path) -> bool:
+    """Return True when the file looks like a SQLite database."""
+    if not path.exists() or path.stat().st_size < len(SQLITE_MAGIC_HEADER):
+        return False
+    with path.open("rb") as handle:
+        return handle.read(len(SQLITE_MAGIC_HEADER)) == SQLITE_MAGIC_HEADER
+def _download_sqlite_file(db_id: str, destination: Path) -> None:
+    """Download one Spider SQLite file into destination.
+    Args:
+        db_id: Spider database identifier.
+        destination: Path to write ``{db_id}.sqlite``.
+    Raises:
+        FileNotFoundError: If all sources fail for this ``db_id``.
+    """
+    _validate_db_id(db_id)
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    last_error: str | None = None
+    for url_template in SPIDER_SQLITE_URLS:
+        url = url_template.format(db_id=db_id)
+        for attempt in range(2):
+            try:
+                response = requests.get(url, timeout=30)
+                response.raise_for_status()
+                tmp_path = destination.with_suffix(".sqlite.tmp")
+                tmp_path.write_bytes(response.content)
+                if not _is_valid_sqlite_file(tmp_path):
+                    tmp_path.unlink(missing_ok=True)
+                    raise FileNotFoundError(
+                        f"Downloaded payload for '{db_id}' was not a valid SQLite file"
+                    )
+                tmp_path.replace(destination)
+                return
+            except (requests.RequestException, OSError, FileNotFoundError) as exc:
+                last_error = str(exc)
+                if attempt == 0:
+                    time.sleep(5)
+    try:
+        archive_bytes = _download_spider_archive()
+        _extract_sqlite_from_archive(
+            archive_bytes=archive_bytes,
+            db_id=db_id,
+            destination=destination,
+        )
+        return
+    except (
+        requests.RequestException,
+        OSError,
+        FileNotFoundError,
+        zipfile.BadZipFile,
+    ) as exc:
+        last_error = str(exc)
+    raise FileNotFoundError(
+        f"Unable to download Spider SQLite for '{db_id}'. Last error: {last_error}"
+    )
+def _download_spider_archive() -> bytes:
+    """Download and cache official Spider dataset archive bytes."""
+    global _SPIDER_ARCHIVE_BYTES
+    if _SPIDER_ARCHIVE_BYTES is not None:
+        return _SPIDER_ARCHIVE_BYTES
+    last_error: str | None = None
+    for attempt in range(2):
+        try:
+            session = requests.Session()
+            warning_page = session.get(
+                f"https://drive.google.com/uc?export=download&id={SPIDER_DATASET_FILE_ID}",
+                timeout=60,
+            )
+            warning_page.raise_for_status()
+            payload = warning_page.content
+            content_type = warning_page.headers.get("content-type", "")
+            if "text/html" in content_type.lower():
+                page_text = warning_page.text
+                params: dict[str, str] = {
+                    "id": SPIDER_DATASET_FILE_ID,
+                    "export": "download",
+                }
+                for field in ("confirm", "uuid"):
+                    match = re.search(
+                        rf'name="{field}" value="([^"]+)"',
+                        page_text,
+                    )
+                    if match:
+                        params[field] = match.group(1)
+                download_response = session.get(
+                    SPIDER_DATASET_DOWNLOAD_URL,
+                    params=params,
+                    timeout=240,
+                )
+                download_response.raise_for_status()
+                payload = download_response.content
+            if not payload.startswith(b"PK"):
+                raise FileNotFoundError(
+                    "Spider dataset download did not return a zip file"
+                )
+            _SPIDER_ARCHIVE_BYTES = payload
+            return _SPIDER_ARCHIVE_BYTES
+        except (requests.RequestException, FileNotFoundError) as exc:
+            last_error = str(exc)
+            if attempt == 0:
+                time.sleep(5)
+    raise FileNotFoundError(
+        f"Unable to download Spider dataset zip. Last error: {last_error}"
+    )
+def _extract_sqlite_from_archive(
+    archive_bytes: bytes, db_id: str, destination: Path
+) -> None:
+    """Extract one SQLite file from the Spider zip archive."""
+    candidate_members = (
+        f"spider_data/database/{db_id}/{db_id}.sqlite",
+        f"spider/database/{db_id}/{db_id}.sqlite",
+        f"spider-master/database/{db_id}/{db_id}.sqlite",
+    )
+    payload: bytes | None = None
+    with zipfile.ZipFile(io.BytesIO(archive_bytes)) as archive:
+        for member_name in candidate_members:
+            try:
+                payload = archive.read(member_name)
+                break
+            except KeyError:
+                continue
+    if payload is None:
+        raise FileNotFoundError(f"Database '{db_id}' not found in Spider archive")
+    tmp_path = destination.with_suffix(".sqlite.tmp")
+    tmp_path.write_bytes(payload)
+    if not _is_valid_sqlite_file(tmp_path):
+        tmp_path.unlink(missing_ok=True)
+        raise FileNotFoundError(
+            f"Archive payload for '{db_id}' was not a valid SQLite file"
+        )
+    tmp_path.replace(destination)
+def download_spider_databases(db_ids: list[str], output_dir: Path) -> dict[str, Path]:
+    """Download Spider SQLite database files for selected ``db_ids``.
+    Existing files are reused and not downloaded again.
+    Args:
+        db_ids: Spider database IDs.
+        output_dir: Base output directory (e.g. ``data/databases``).
+    Returns:
+        Mapping of ``db_id`` to local SQLite path.
+    Raises:
+        FileNotFoundError: If no requested database can be prepared.
+    """
+    db_paths: dict[str, Path] = {}
+    output_root = output_dir.resolve()
+    for db_id in db_ids:
+        _validate_db_id(db_id)
+        sqlite_path = output_dir / db_id / f"{db_id}.sqlite"
+        resolved_path = sqlite_path.resolve()
+        if output_root not in resolved_path.parents:
+            raise ValueError(
+                "Resolved path "
+                f"'{resolved_path}' escapes output directory '{output_root}'"
+            )
+        if _is_valid_sqlite_file(sqlite_path):
+            db_paths[db_id] = sqlite_path
+            continue
+        try:
+            _download_sqlite_file(db_id=db_id, destination=sqlite_path)
+        except FileNotFoundError as exc:
+            LOGGER.warning("Skipping database '%s': %s", db_id, exc)
+            continue
+        db_paths[db_id] = sqlite_path
+    if not db_paths:
+        raise FileNotFoundError("No Spider SQLite databases could be prepared")
+    return db_paths
+def _load_questions_from_hf_datasets(db_ids: set[str]) -> list[dict[str, Any]]:
+    """Load questions through the `datasets` package when available."""
+    try:
+        from datasets import load_dataset
+    except ImportError as exc:
+        raise ConnectionError("`datasets` package is not installed") from exc
+    records: list[dict[str, Any]] = []
+    for spider_split in ("train", "validation"):
+        for row in load_dataset("xlangai/spider", split=spider_split):
+            db_id = row.get("db_id")
+            if db_id not in db_ids:
+                continue
+            records.append(
+                {
+                    "db_id": db_id,
+                    "query": row.get("query", ""),
+                    "question": row.get("question", ""),
+                    "spider_split": spider_split,
+                }
+            )
+    return records
+def _load_questions_from_spider_archive(db_ids: set[str]) -> list[dict[str, Any]]:
+    """Load Spider questions from the official dataset zip archive."""
+    archive_bytes = _download_spider_archive()
+    records: list[dict[str, Any]] = []
+    split_files = (
+        ("spider_data/train_spider.json", "train"),
+        ("spider_data/dev.json", "validation"),
+    )
+    with zipfile.ZipFile(io.BytesIO(archive_bytes)) as archive:
+        for member_name, spider_split in split_files:
+            try:
+                payload = archive.read(member_name)
+            except KeyError:
+                continue
+            rows = json.loads(payload.decode("utf-8"))
+            if not isinstance(rows, list):
+                continue
+            for row in rows:
+                if not isinstance(row, dict):
+                    continue
+                db_id = row.get("db_id")
+                if db_id not in db_ids:
+                    continue
+                records.append(
+                    {
+                        "db_id": db_id,
+                        "query": row.get("query", ""),
+                        "question": row.get("question", ""),
+                        "spider_split": spider_split,
+                    }
+                )
+    if not records:
+        raise ConnectionError(
+            "No Spider questions found in archive for selected db_ids"
+        )
+    return records
+def _load_questions_from_hf_rows_api(db_ids: set[str]) -> list[dict[str, Any]]:
+    """Load Spider questions from the HuggingFace datasets rows API."""
+    endpoint = "https://datasets-server.huggingface.co/rows"
+    records: list[dict[str, Any]] = []
+    for spider_split in ("train", "validation"):
+        offset = 0
+        length = 100
+        while True:
+            params = {
+                "dataset": "xlangai/spider",
+                "config": "spider",
+                "split": spider_split,
+                "offset": offset,
+                "length": length,
+            }
+            response = requests.get(endpoint, params=params, timeout=30)
+            response.raise_for_status()
+            payload = response.json()
+            rows = payload.get("rows", [])
+            if not rows:
+                break
+            for row_payload in rows:
+                row = row_payload.get("row", {})
+                db_id = row.get("db_id")
+                if db_id not in db_ids:
+                    continue
+                records.append(
+                    {
+                        "db_id": db_id,
+                        "query": row.get("query", ""),
+                        "question": row.get("question", ""),
+                        "spider_split": spider_split,
+                    }
+                )
+            offset += len(rows)
+    return records
+def load_spider_questions(db_ids: list[str]) -> list[dict[str, Any]]:
+    """Load raw Spider questions for selected databases.
+    Args:
+        db_ids: Spider database IDs.
+    Returns:
+        Filtered list of question records including ``spider_split`` metadata.
+    Raises:
+        ConnectionError: If all loading strategies fail.
+    """
+    if not db_ids:
+        return []
+    db_set = set(db_ids)
+    for db_id in db_set:
+        _validate_db_id(db_id)
+    loaders: tuple[Callable[[set[str]], list[dict[str, Any]]], ...] = (
+        _load_questions_from_spider_archive,
+        _load_questions_from_hf_datasets,
+        _load_questions_from_hf_rows_api,
+    )
+    last_error: str | None = None
+    for loader in loaders:
+        for attempt in range(2):
+            try:
+                return loader(db_set)
+            except (ConnectionError, OSError, requests.RequestException) as exc:
+                last_error = f"{loader.__name__}: {exc}"
+                if attempt == 0:
+                    time.sleep(5)
+    raise ConnectionError(
+        f"Unable to load Spider questions from HuggingFace. Last error: {last_error}"
+    )
+def _shape_rows(rows: list[tuple[Any, ...]]) -> Any:
+    """Shape SQL rows into scalar/list/table forms used by the dataset."""
+    if not rows:
+        return []
+    column_count = len(rows[0])
+    if column_count == 1:
+        values = [row[0] for row in rows]
+        if len(values) == 1:
+            return values[0]
+        return values
+    return [list(row) for row in rows]
+def compute_gold_answer(gold_sql: str, db_path: Path) -> Any:
+    """Execute gold SQL against SQLite and return a normalized result."""
+    if not db_path.exists():
+        raise FileNotFoundError(f"Database not found: {db_path}")
+    if not _is_valid_sqlite_file(db_path):
+        raise sqlite3.Error(f"Invalid SQLite database file: {db_path}")
+    db_uri = f"file:{quote(str(db_path.resolve()))}?mode=ro"
+    with sqlite3.connect(db_uri, uri=True) as conn:
+        cursor = conn.execute(gold_sql)
+        rows = cursor.fetchall()
+    return _shape_rows(rows)
+def classify_answer_type(gold_answer: Any) -> str:
+    """Classify the answer type for a computed gold answer."""
+    if isinstance(gold_answer, bool):
+        return "integer"
+    if isinstance(gold_answer, int):
+        return "integer"
+    if isinstance(gold_answer, float):
+        return "float"
+    if isinstance(gold_answer, str):
+        return "string"
+    if isinstance(gold_answer, tuple):
+        if len(gold_answer) == 1:
+            return classify_answer_type(gold_answer[0])
+        return "table"
+    if isinstance(gold_answer, list):
+        if not gold_answer:
+            return "list"
+        first = gold_answer[0]
+        if isinstance(first, (list, tuple)):
+            return "table"
+        return "list"
+    if gold_answer is None:
+        return "list"
+    raise ValueError(f"Unsupported gold_answer type: {type(gold_answer).__name__}")
+def extract_tables_involved(gold_sql: str) -> list[str]:
+    """Extract table names referenced after FROM/JOIN tokens."""
+    if not gold_sql.strip():
+        return []
+    cte_aliases = {
+        match.group(1).lower() for match in CTE_ALIAS_PATTERN.finditer(gold_sql)
+    }
+    tables: set[str] = set()
+    for match in TABLE_TOKEN_PATTERN.finditer(gold_sql):
+        normalized = _normalize_table_name(match.group(1))
+        if normalized and normalized.lower() not in cte_aliases:
+            tables.add(normalized)
+    return sorted(tables)
+def classify_difficulty(tables_involved: Iterable[str]) -> str:
+    """Assign difficulty from the number of tables involved."""
+    table_count = len({name for name in tables_involved if name})
+    if table_count <= 2:
+        return "easy"
+    if table_count == 3:
+        return "medium"
+    return "hard"
+def _load_db_list(db_list_path: Path) -> list[str]:
+    """Load database IDs from a JSON array file."""
+    payload = json.loads(db_list_path.read_text(encoding="utf-8"))
+    if not isinstance(payload, list) or not all(
+        isinstance(item, str) for item in payload
+    ):
+        raise ValueError(f"Expected JSON list[str] in {db_list_path}")
+    return payload
+def assign_splits(questions: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Assign SQLEnv train/eval splits from Spider split metadata."""
+    split_questions: list[dict[str, Any]] = []
+    for question in questions:
+        spider_split = str(question.get("spider_split", "")).lower()
+        if spider_split in {"validation", EVAL_SPLIT}:
+            split = EVAL_SPLIT
+        elif spider_split in {"train", TRAIN_SPLIT}:
+            split = TRAIN_SPLIT
+        else:
+            LOGGER.warning(
+                "Unknown spider_split '%s' for database '%s'; defaulting to train",
+                spider_split,
+                question.get("database_name", "unknown"),
+            )
+            split = TRAIN_SPLIT
+        updated = dict(question)
+        updated["split"] = split
+        split_questions.append(updated)
+    total = len(split_questions)
+    if total <= 1:
+        return split_questions
+    train_records = [q for q in split_questions if q["split"] == TRAIN_SPLIT]
+    eval_records = [q for q in split_questions if q["split"] == EVAL_SPLIT]
+    if not train_records or not eval_records:
+        return split_questions
+    target_eval_count = max(1, round(total * 0.3))
+    current_eval_count = len(eval_records)
+    if current_eval_count >= target_eval_count:
+        if current_eval_count == target_eval_count:
+            return split_questions
+        excess = min(current_eval_count - target_eval_count, len(eval_records))
+        for index in range(excess):
+            eval_records[index]["split"] = TRAIN_SPLIT
+        return split_questions
+    needed = min(target_eval_count - current_eval_count, len(train_records))
+    for index in range(needed):
+        train_records[index]["split"] = EVAL_SPLIT
+    return split_questions
+def _sort_enriched_questions(
+    questions: list[dict[str, Any]],
+) -> list[dict[str, Any]]:
+    """Return deterministically ordered records for stable output files."""
+    return sorted(
+        questions,
+        key=lambda item: (
+            str(item.get("database_name", "")),
+            str(item.get("spider_split", "")),
+            str(item.get("gold_sql", "")),
+            str(item.get("question_text", "")),
+        ),
+    )
+def _assign_question_ids(questions: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Assign IDs with format ``{db_id}_{split}_{index:03d}`` per db/split."""
+    counters: dict[tuple[str, str], int] = {}
+    with_ids: list[dict[str, Any]] = []
+    for question in questions:
+        db_id = str(question["database_name"])
+        split = str(question["split"])
+        key = (db_id, split)
+        index = counters.get(key, 0)
+        counters[key] = index + 1
+        updated = dict(question)
+        updated["question_id"] = f"{db_id}_{split}_{index:03d}"
+        with_ids.append(updated)
+    return with_ids
+def _write_output(path: Path, records: list[dict[str, Any]]) -> None:
+    """Write JSON records to disk."""
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(records, indent=2, ensure_ascii=False), encoding="utf-8")
+def _load_output_questions(path: Path) -> list[dict[str, Any]]:
+    """Load curated output records from a JSON file."""
+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+    except FileNotFoundError as exc:
+        raise ValueError(f"Output dataset file not found: {path}") from exc
+    except json.JSONDecodeError as exc:
+        raise ValueError(f"Output dataset file is invalid JSON: {path}") from exc
+    if not isinstance(payload, list):
+        raise ValueError(f"Expected JSON list in {path}")
+    records: list[dict[str, Any]] = []
+    for index, item in enumerate(payload):
+        if not isinstance(item, dict):
+            raise ValueError(f"Expected record object at index {index} in {path}")
+        records.append(item)
+    return records
+def _question_fingerprint(record: dict[str, Any]) -> tuple[str, str, str]:
+    """Build a stable identity tuple for split leakage checks."""
+    return (
+        str(record.get("database_name", "")),
+        str(record.get("question_text", "")),
+        str(record.get("gold_sql", "")),
+    )
+def validate_dataset(
+    questions: list[dict[str, Any]],
+    db_paths: dict[str, Path],
+) -> list[str]:
+    """Validate curated records and return all detected issues."""
+    errors: list[str] = []
+    question_ids: set[str] = set()
+    train_fingerprints: set[tuple[str, str, str]] = set()
+    eval_fingerprints: set[tuple[str, str, str]] = set()
+    difficulty_counts: dict[str, int] = {key: 0 for key in VALID_DIFFICULTIES}
+    for index, question in enumerate(questions):
+        context = f"record[{index}]"
+        missing = [field for field in REQUIRED_FIELDS if field not in question]
+        if missing:
+            errors.append(f"{context}: missing required fields: {', '.join(missing)}")
+            continue
+        question_id = str(question["question_id"]).strip()
+        if not question_id:
+            errors.append(f"{context}: question_id must be non-empty")
+        elif question_id in question_ids:
+            errors.append(f"{context}: duplicate question_id '{question_id}'")
+        else:
+            question_ids.add(question_id)
+        question_text = str(question["question_text"]).strip()
+        if not question_text:
+            errors.append(f"{context}: question_text must be non-empty")
+        db_id = str(question["database_name"]).strip()
+        if not db_id:
+            errors.append(f"{context}: database_name must be non-empty")
+            continue
+        gold_sql = str(question["gold_sql"]).strip()
+        if not gold_sql:
+            errors.append(f"{context}: gold_sql must be non-empty")
+        answer_type = str(question["answer_type"]).strip()
+        if answer_type not in VALID_ANSWER_TYPES:
+            errors.append(
+                f"{context}: answer_type '{answer_type}' is invalid "
+                f"(expected one of {sorted(VALID_ANSWER_TYPES)})"
+            )
+        difficulty = str(question["difficulty"]).strip()
+        if difficulty not in VALID_DIFFICULTIES:
+            errors.append(
+                f"{context}: difficulty '{difficulty}' is invalid "
+                f"(expected one of {sorted(VALID_DIFFICULTIES)})"
+            )
+        else:
+            difficulty_counts[difficulty] += 1
+        tables = question["tables_involved"]
+        if not isinstance(tables, list) or not tables:
+            errors.append(f"{context}: tables_involved must be a non-empty list")
+        elif not all(
+            isinstance(table_name, str) and table_name.strip() for table_name in tables
+        ):
+            errors.append(
+                f"{context}: tables_involved must contain non-empty table name strings"
+            )
+        split = str(question["split"]).strip()
+        if split not in VALID_SPLITS:
+            errors.append(
+                f"{context}: split '{split}' is invalid "
+                f"(expected one of {sorted(VALID_SPLITS)})"
+            )
+        else:
+            fingerprint = _question_fingerprint(question)
+            if split == TRAIN_SPLIT:
+                train_fingerprints.add(fingerprint)
+            else:
+                eval_fingerprints.add(fingerprint)
+        if gold_sql and db_id in db_paths:
+            try:
+                recomputed = compute_gold_answer(
+                    gold_sql=gold_sql, db_path=db_paths[db_id]
+                )
+                if recomputed != question["gold_answer"]:
+                    errors.append(
+                        f"{context}: gold_answer mismatch"
+                        f" for question_id '{question_id}'"
+                    )
+            except (sqlite3.Error, FileNotFoundError) as exc:
+                errors.append(
+                    f"{context}: gold_sql execution failed"
+                    f" for database '{db_id}': {exc}"
+                )
+        elif db_id not in db_paths:
+            errors.append(
+                f"{context}: missing database path"
+                f" for '{db_id}' (expected in data/databases)"
+            )
+    leaked = sorted(train_fingerprints.intersection(eval_fingerprints))
+    if leaked:
+        errors.append(
+            f"train/eval split leak detected:"
+            f" {len(leaked)} question(s) appear in both splits"
+        )
+    total = len(questions)
+    if total > 0:
+        easy_ratio = difficulty_counts["easy"] / total
+        medium_ratio = difficulty_counts["medium"] / total
+        hard_ratio = difficulty_counts["hard"] / total
+        if abs(easy_ratio - 0.40) > 0.20:
+            LOGGER.warning(
+                "Difficulty distribution off target: easy=%s (target 40%%)",
+                f"{easy_ratio:.2%}",
+            )
+        if abs(medium_ratio - 0.40) > 0.20:
+            LOGGER.warning(
+                "Difficulty distribution off target: medium=%s (target 40%%)",
+                f"{medium_ratio:.2%}",
+            )
+        if abs(hard_ratio - 0.20) > 0.15:
+            LOGGER.warning(
+                "Difficulty distribution off target: hard=%s (target 20%%)",
+                f"{hard_ratio:.2%}",
+            )
+    return errors
+def main() -> None:
+    """CLI entry point for the dataset curation pipeline."""
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+    parser = argparse.ArgumentParser(
+        description="Curate Spider questions into enriched train/eval JSON files."
+    )
+    parser.add_argument(
+        "--db-list",
+        type=Path,
+        default=Path("data/questions/db_list.json"),
+        help="Path to JSON list of Spider database IDs.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path("data/databases"),
+        help="Directory where SQLite files will be stored.",
+    )
+    parser.add_argument(
+        "--validate",
+        action="store_true",
+        help="Validate existing output files instead of running full curation.",
+    )
+    parser.add_argument(
+        "--train-output",
+        type=Path,
+        default=Path("data/questions/questions_train.json"),
+        help="Output path for curated train questions.",
+    )
+    parser.add_argument(
+        "--eval-output",
+        type=Path,
+        default=Path("data/questions/questions_eval.json"),
+        help="Output path for curated eval questions.",
+    )
+    args = parser.parse_args()
+    if args.validate:
+        try:
+            train_questions = _load_output_questions(args.train_output)
+            eval_questions = _load_output_questions(args.eval_output)
+        except ValueError as exc:
+            print(f"ERROR: {exc}")
+            raise SystemExit(1) from exc
+        questions = train_questions + eval_questions
+        db_ids = sorted(
+            {str(record.get("database_name", "")).strip() for record in questions}
+        )
+        try:
+            for db_id in db_ids:
+                _validate_db_id(db_id)
+        except ValueError as exc:
+            print(f"ERROR: {exc}")
+            raise SystemExit(1) from exc
+        db_paths = {
+            db_id: args.output_dir / db_id / f"{db_id}.sqlite"
+            for db_id in db_ids
+            if db_id
+        }
+        errors = validate_dataset(questions=questions, db_paths=db_paths)
+        if errors:
+            for error in errors:
+                print(f"ERROR: {error}")
+            raise SystemExit(1)
+        print(f"Validation passed for {len(questions)} curated records")
+        raise SystemExit(0)
+    db_ids = _load_db_list(args.db_list)
+    db_paths = download_spider_databases(db_ids=db_ids, output_dir=args.output_dir)
+    raw_questions = load_spider_questions(db_ids)
+    enriched_questions: list[dict[str, Any]] = []
+    skipped_count = 0
+    for raw_question in raw_questions:
+        db_id = str(raw_question.get("db_id", "")).strip()
+        if db_id not in db_paths:
+            skipped_count += 1
+            continue
+        gold_sql = str(raw_question.get("query", "")).strip()
+        question_text = str(raw_question.get("question", "")).strip()
+        if not gold_sql or not question_text:
+            skipped_count += 1
+            continue
+        try:
+            gold_answer = compute_gold_answer(
+                gold_sql=gold_sql,
+                db_path=db_paths[db_id],
+            )
+        except sqlite3.Error as exc:
+            LOGGER.warning(
+                "Skipping question for database '%s' due to SQL execution failure: %s",
+                db_id,
+                exc,
+            )
+            skipped_count += 1
+            continue
+        tables_involved = extract_tables_involved(gold_sql)
+        if not tables_involved:
+            LOGGER.warning(
+                "Skipping question for database '%s' because no tables were extracted",
+                db_id,
+            )
+            skipped_count += 1
+            continue
+        enriched_questions.append(
+            {
+                "question_text": question_text,
+                "database_name": db_id,
+                "gold_sql": gold_sql,
+                "gold_answer": gold_answer,
+                "answer_type": classify_answer_type(gold_answer),
+                "difficulty": classify_difficulty(tables_involved),
+                "tables_involved": tables_involved,
+                "spider_split": raw_question.get("spider_split", "train"),
+            }
+        )
+    split_questions = assign_splits(_sort_enriched_questions(enriched_questions))
+    final_questions = _assign_question_ids(split_questions)
+    validation_errors = validate_dataset(questions=final_questions, db_paths=db_paths)
+    if validation_errors:
+        for error in validation_errors:
+            print(f"ERROR: {error}")
+        raise SystemExit(1)
+    train_questions: list[dict[str, Any]] = []
+    eval_questions: list[dict[str, Any]] = []
+    for record in final_questions:
+        output_record = {
+            key: value for key, value in record.items() if key != "spider_split"
+        }
+        if output_record["split"] == TRAIN_SPLIT:
+            train_questions.append(output_record)
+        else:
+            eval_questions.append(output_record)
+    _write_output(args.train_output, train_questions)
+    _write_output(args.eval_output, eval_questions)
+    print(f"Prepared {len(db_paths)} databases in {args.output_dir}")
+    print(f"Loaded {len(raw_questions)} Spider questions")
+    print(f"Curated {len(final_questions)} questions (skipped {skipped_count})")
+    print("Validation passed")
+    print(f"Wrote {len(train_questions)} train records to {args.train_output}")
+    print(f"Wrote {len(eval_questions)} eval records to {args.eval_output}")
+if __name__ == "__main__":
+    main()

scripts/download_spider_data.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+Script to download Spider dataset questions for specific databases.
+Usage:
+    python download_spider_data.py --db-id student_assessment
+    python download_spider_data.py --db-id student_assessment --split validation
+    python download_spider_data.py --db-id all  # downloads all db_ids
+"""
+import json
+import argparse
+from pathlib import Path
+from datasets import load_dataset
+def download_spider_questions(
+    db_id: str = "student_assessment",
+    split: str = "train",
+    output_dir: str = "data/questions",
+) -> None:
+    """Download Spider dataset questions for specified database(s).
+    Args:
+        db_id: Database ID to filter by, or "all" to get all databases
+        split: Dataset split ("train" or "validation")
+        output_dir: Directory to save JSON files
+    """
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    print(f"Loading Spider dataset ({split} split)...")
+    dataset = load_dataset("xlangai/spider", split=split)
+    if db_id.lower() == "all":
+        # Group by db_id
+        grouped = {}
+        for item in dataset:
+            current_db_id = item.get("db_id")
+            if current_db_id not in grouped:
+                grouped[current_db_id] = []
+            grouped[current_db_id].append(item)
+        total_questions = 0
+        for current_db_id, questions in grouped.items():
+            filepath = output_path / f"{current_db_id}.json"
+            with open(filepath, "w") as f:
+                json.dump(questions, f, indent=2)
+            print(f"  {current_db_id}: {len(questions)} questions → {filepath}")
+            total_questions += len(questions)
+        print(f"\nTotal: {total_questions} questions across {len(grouped)} databases")
+    else:
+        # Filter for specific db_id
+        filtered_data = [item for item in dataset if item.get("db_id") == db_id]
+        if not filtered_data:
+            print(f"No questions found for db_id='{db_id}'")
+            return
+        filepath = output_path / f"{db_id}.json"
+        with open(filepath, "w") as f:
+            json.dump(filtered_data, f, indent=2)
+        print(f"Found {len(filtered_data)} questions for db_id='{db_id}'")
+        print(f"Saved to {filepath}")
+        # Print sample
+        if filtered_data:
+            sample = filtered_data[0]
+            print("\nFirst question sample:")
+            print(
+                json.dumps(
+                    {k: v for k, v in sample.items() if k != "evidence"}, indent=2
+                )
+            )
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Download Spider dataset questions for specific databases",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--db-id",
+        type=str,
+        default="student_assessment",
+        help="Database ID to filter by (or 'all' for all databases)",
+    )
+    parser.add_argument(
+        "--split",
+        type=str,
+        default="train",
+        choices=["train", "validation"],
+        help="Dataset split to download",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="data/questions",
+        help="Directory to save JSON files",
+    )
+    args = parser.parse_args()
+    download_spider_questions(
+        db_id=args.db_id, split=args.split, output_dir=args.output_dir
+    )

scripts/download_spider_databases.py ADDED Viewed

	@@ -0,0 +1,301 @@

+"""Download Spider SQLite databases used by SQLEnv.
+Uses the same download logic as curate_questions.py: tries GitHub raw URLs
+first, then falls back to the official Google Drive Spider archive.
+Examples
+--------
+Download the default database (student_assessment):
+    uv run python scripts/download_spider_databases.py
+Download a specific database:
+    uv run python scripts/download_spider_databases.py --db-id concert_singer
+Download all databases referenced in db_list.json:
+    uv run python scripts/download_spider_databases.py --db-id all
+Force re-download:
+    uv run python scripts/download_spider_databases.py --force
+"""
+from __future__ import annotations
+import argparse
+import io
+import json
+import re
+import time
+import zipfile
+from pathlib import Path
+from urllib.error import HTTPError, URLError
+from urllib.request import Request, urlopen
+SPIDER_RAW_SQLITE_URLS = (
+    "https://raw.githubusercontent.com/taoyds/spider/master/database/{db_id}/{db_id}.sqlite",
+    "https://github.com/taoyds/spider/raw/master/database/{db_id}/{db_id}.sqlite",
+)
+SPIDER_ARCHIVE_DRIVE_ID = "1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J"
+SQLITE_MAGIC = b"SQLite format 3\x00"
+DB_LIST_PATH = Path("data/questions/db_list.json")
+def _validate_db_id(db_id: str) -> str:
+    normalized = db_id.strip()
+    if not normalized:
+        raise ValueError("db_id cannot be empty")
+    if not re.fullmatch(r"[A-Za-z0-9_]+", normalized):
+        raise ValueError(
+            "Invalid db_id — only letters, numbers, and underscores allowed."
+        )
+    return normalized
+def _is_valid_sqlite(path: Path) -> bool:
+    if not path.exists() or path.stat().st_size < 16:
+        return False
+    with path.open("rb") as f:
+        return f.read(16) == SQLITE_MAGIC
+def _safe_sqlite_path(output_dir: Path, db_id: str) -> Path:
+    sqlite_path = output_dir / db_id / f"{db_id}.sqlite"
+    output_root = output_dir.resolve()
+    resolved = sqlite_path.resolve()
+    if output_root not in resolved.parents:
+        raise ValueError(f"Resolved path escapes output directory: {resolved}")
+    return sqlite_path
+def _try_raw_download(db_id: str, destination: Path) -> bool:
+    """Try downloading from GitHub raw URLs. Returns True on success."""
+    for url_template in SPIDER_RAW_SQLITE_URLS:
+        url = url_template.format(db_id=db_id)
+        try:
+            req = Request(url, headers={"User-Agent": "sqlenv/1.0"})
+            with urlopen(req, timeout=30) as resp:
+                data = resp.read()
+            if not data.startswith(SQLITE_MAGIC):
+                continue
+            tmp = destination.with_suffix(".tmp")
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            tmp.write_bytes(data)
+            tmp.replace(destination)
+            return True
+        except (HTTPError, URLError, OSError):
+            continue
+    return False
+def _download_drive_archive() -> bytes:
+    """Download official Spider archive from Google Drive."""
+    drive_url = (
+        f"https://drive.google.com/uc?export=download&id={SPIDER_ARCHIVE_DRIVE_ID}"
+    )
+    req = Request(drive_url, headers={"User-Agent": "sqlenv/1.0"})
+    for attempt in range(2):
+        try:
+            with urlopen(req, timeout=120) as resp:
+                payload = resp.read()
+            if payload.startswith(b"PK"):
+                return payload
+            # Google Drive virus-scan warning page — parse confirm token
+            text = payload.decode("utf-8", errors="replace")
+            confirm_match = re.search(r'name="confirm" value="([^"]+)"', text)
+            if confirm_match:
+                confirm_url = (
+                    "https://drive.usercontent.google.com/download"
+                    f"?id={SPIDER_ARCHIVE_DRIVE_ID}"
+                    f"&export=download&confirm={confirm_match.group(1)}"
+                )
+                confirm_req = Request(
+                    confirm_url,
+                    headers={"User-Agent": "sqlenv/1.0"},
+                )
+                with urlopen(confirm_req, timeout=240) as resp2:
+                    payload = resp2.read()
+                if payload.startswith(b"PK"):
+                    return payload
+            raise RuntimeError("Drive response was not a zip file")
+        except (HTTPError, URLError, OSError, RuntimeError):
+            if attempt == 0:
+                time.sleep(3)
+    raise RuntimeError(
+        "Failed to download Spider archive from Google Drive after retries"
+    )
+def _extract_from_archive(archive_bytes: bytes, db_id: str, destination: Path) -> None:
+    """Extract a single database from the Spider zip archive."""
+    candidates = [
+        f"spider_data/database/{db_id}/{db_id}.sqlite",
+        f"spider/database/{db_id}/{db_id}.sqlite",
+        f"spider-master/database/{db_id}/{db_id}.sqlite",
+    ]
+    with zipfile.ZipFile(io.BytesIO(archive_bytes)) as zf:
+        for member in candidates:
+            try:
+                data = zf.read(member)
+                if data.startswith(SQLITE_MAGIC):
+                    destination.parent.mkdir(parents=True, exist_ok=True)
+                    tmp = destination.with_suffix(".tmp")
+                    tmp.write_bytes(data)
+                    tmp.replace(destination)
+                    return
+            except KeyError:
+                continue
+    raise FileNotFoundError(f"Database '{db_id}' not found in Spider archive")
+def _extract_all_from_archive(
+    archive_bytes: bytes, output_dir: Path, force: bool
+) -> int:
+    """Extract all databases from the Spider archive."""
+    count = 0
+    with zipfile.ZipFile(io.BytesIO(archive_bytes)) as zf:
+        for member in zf.namelist():
+            if not member.endswith(".sqlite"):
+                continue
+            if "/database/" not in member:
+                continue
+            db_name = Path(member).stem
+            target = output_dir / db_name / f"{db_name}.sqlite"
+            if target.exists() and not force:
+                continue
+            data = zf.read(member)
+            if not data.startswith(SQLITE_MAGIC):
+                continue
+            target.parent.mkdir(parents=True, exist_ok=True)
+            tmp = target.with_suffix(".tmp")
+            tmp.write_bytes(data)
+            tmp.replace(target)
+            count += 1
+    return count
+def download_database(db_id: str, output_dir: Path, force: bool = False) -> Path:
+    """Download one Spider database, with Google Drive fallback."""
+    normalized = _validate_db_id(db_id)
+    sqlite_path = _safe_sqlite_path(output_dir, normalized)
+    if _is_valid_sqlite(sqlite_path) and not force:
+        print(f"Already exists: {sqlite_path}")
+        return sqlite_path
+    print(f"Downloading {normalized}...")
+    if _try_raw_download(normalized, sqlite_path):
+        print(f"  -> {sqlite_path} (from GitHub)")
+        return sqlite_path
+    print("  GitHub raw URLs failed, trying Google Drive archive...")
+    archive_bytes = _download_drive_archive()
+    _extract_from_archive(archive_bytes, normalized, sqlite_path)
+    print(f"  -> {sqlite_path} (from Drive archive)")
+    return sqlite_path
+def download_all(output_dir: Path, force: bool = False) -> int:
+    """Download all databases from Google Drive archive."""
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print("Downloading Spider archive from Google Drive...")
+    archive_bytes = _download_drive_archive()
+    count = _extract_all_from_archive(archive_bytes, output_dir, force)
+    print(f"Extracted {count} database(s) to {output_dir}")
+    return count
+def download_listed(output_dir: Path, force: bool = False) -> int:
+    """Download databases listed in db_list.json."""
+    if not DB_LIST_PATH.exists():
+        raise FileNotFoundError(
+            f"{DB_LIST_PATH} not found — run curate_questions.py first "
+            "or use --db-id <name> to download individual databases"
+        )
+    db_ids = json.loads(DB_LIST_PATH.read_text())
+    print(f"Downloading {len(db_ids)} databases from db_list.json...")
+    # Try GitHub raw first, batch fallback to archive for failures
+    remaining = []
+    for db_id in db_ids:
+        normalized = _validate_db_id(db_id)
+        sqlite_path = _safe_sqlite_path(output_dir, normalized)
+        if _is_valid_sqlite(sqlite_path) and not force:
+            print(f"  Already exists: {normalized}")
+            continue
+        if _try_raw_download(normalized, sqlite_path):
+            print(f"  Downloaded: {normalized} (GitHub)")
+        else:
+            remaining.append(normalized)
+    if remaining:
+        print(
+            f"  {len(remaining)} failed from GitHub, falling back to Drive archive..."
+        )
+        archive_bytes = _download_drive_archive()
+        for db_id in remaining:
+            sqlite_path = _safe_sqlite_path(output_dir, db_id)
+            try:
+                _extract_from_archive(archive_bytes, db_id, sqlite_path)
+                print(f"  Downloaded: {db_id} (Drive archive)")
+            except FileNotFoundError:
+                print(f"  FAILED: {db_id} not found in archive")
+    downloaded = sum(
+        1
+        for db_id in db_ids
+        if _is_valid_sqlite(output_dir / db_id / f"{db_id}.sqlite")
+    )
+    print(f"Ready: {downloaded}/{len(db_ids)} databases in {output_dir}")
+    return downloaded
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Download Spider SQLite databases for SQLEnv",
+    )
+    parser.add_argument(
+        "--db-id",
+        type=str,
+        default=None,
+        help=(
+            "Spider database ID to download. "
+            "Use 'all' for every Spider DB, or omit to download "
+            "databases listed in data/questions/db_list.json"
+        ),
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path("data/databases"),
+        help="Directory to store databases (default: data/databases)",
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Overwrite existing files",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    if args.db_id is None:
+        download_listed(output_dir=args.output_dir, force=args.force)
+    elif args.db_id.lower() == "all":
+        download_all(output_dir=args.output_dir, force=args.force)
+    else:
+        download_database(
+            db_id=args.db_id,
+            output_dir=args.output_dir,
+            force=args.force,
+        )
+if __name__ == "__main__":
+    main()

scripts/generate_models_from_schema.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""
+Script to download Spider schema and auto-generate SQLAlchemy models.
+The spider-schema dataset contains detailed database schemas including
+table names, column names, types, and relationships. This script
+downloads the schema and generates SQLAlchemy ORM models.
+Usage:
+    # Generate models for student_assessment database
+    python generate_models_from_schema.py --db-id student_assessment
+    # Generate for multiple databases
+    python generate_models_from_schema.py --db-id all --output-dir models/
+    # Load from validation split
+    python generate_models_from_schema.py --db-id student_assessment --split validation
+"""
+import json
+import argparse
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from datasets import load_dataset
+# Type mapping from Spider schema to SQLAlchemy
+SQLALCHEMY_TYPE_MAP = {
+    "number": "Integer",
+    "int": "Integer",
+    "float": "Float",
+    "text": "String",
+    "string": "String",
+    "varchar": "String",
+    "char": "String",
+    "date": "Date",
+    "datetime": "DateTime",
+    "timestamp": "DateTime",
+    "time": "DateTime",
+    "boolean": "Boolean",
+    "bool": "Boolean",
+}
+def get_sqlalchemy_type(col_type: str) -> str:
+    """Convert Spider schema type to SQLAlchemy type."""
+    col_type_lower = col_type.lower().strip()
+    # Exact match
+    if col_type_lower in SQLALCHEMY_TYPE_MAP:
+        return SQLALCHEMY_TYPE_MAP[col_type_lower]
+    # Substring match (e.g., "varchar(255)" -> "String")
+    for key, sa_type in SQLALCHEMY_TYPE_MAP.items():
+        if key in col_type_lower:
+            return sa_type
+    # Default to String
+    return "String"
+def generate_model_code(
+    db_id: str,
+    tables: List[Dict[str, Any]],
+    schema: Dict[str, Any],
+) -> str:
+    """Generate SQLAlchemy model code from schema.
+    Args:
+        db_id: Database ID
+        tables: List of table schemas
+        schema: Full schema dictionary with relationships
+    Returns:
+        Generated Python code as string
+    """
+    lines = [
+        f'"""',
+        f"SQLAlchemy ORM models for '{db_id}' database.",
+        f'",
+        f"Auto-generated from Spider schema dataset.",
+        f'"""',
+        f"",
+        f"from datetime import datetime",
+        f"from sqlalchemy import Column, Integer, String, Float, Date, DateTime, Boolean, ForeignKey",
+        f"from sqlalchemy.ext.declarative import declarative_base",
+        f"from sqlalchemy.orm import relationship",
+        f"",
+        f"Base = declarative_base()",
+        f"",
+    ]
+    # Generate model for each table
+    table_names = [t["name"] for t in tables]
+    for table in tables:
+        table_name = table["name"]
+        class_name = "".join(word.capitalize() for word in table_name.split("_"))
+        lines.append(f'class {class_name}(Base):')
+        lines.append(f'    """Model for {table_name} table."""')
+        lines.append(f'    __tablename__ = "{table_name}"')
+        lines.append(f"")
+        # Add columns
+        columns = table.get("columns", [])
+        for col in columns:
+            col_name = col["name"]
+            col_type = col.get("type", "text")
+            sa_type = get_sqlalchemy_type(col_type)
+            # Determine if primary key
+            is_pk = col.get("is_primary_key", False)
+            # Determine if foreign key
+            fk_str = ""
+            for fk in schema.get("foreign_keys", []):
+                if fk[0] == (table_names.index(table_name), columns.index(col)):
+                    source_table_idx, target_table_idx = fk
+                    target_col_idx = fk[2] if len(fk) > 2 else 0
+                    target_table = table_names[target_table_idx]
+                    target_col = tables[target_table_idx]["columns"][target_col_idx]["name"]
+                    fk_str = f', ForeignKey("{target_table}.{target_col}")'
+            # Default nullable to False for primary keys
+            nullable = "False" if is_pk else "True"
+            pk_str = ", primary_key=True" if is_pk else ""
+            lines.append(
+                f'    {col_name} = Column({sa_type}({col_type.split("(")[1].rstrip(")")} '
+                f'if "{sa_type}" == "String" else ""){pk_str}{fk_str}, nullable={nullable})'
+            )
+        lines.append(f"")
+    return "\n".join(lines)
+def download_schema_and_generate_models(
+    db_id: str = "student_assessment",
+    split: str = "train",
+    output_dir: str = "data/models",
+) -> None:
+    """Download Spider schema and generate SQLAlchemy models.
+    Args:
+        db_id: Database ID to download schema for
+        split: Dataset split ("train" or "validation")
+        output_dir: Directory to save generated model files
+    """
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    print(f"Loading Spider schema dataset ({split} split)...")
+    dataset = load_dataset("richardr1126/spider-schema", split=split)
+    if db_id.lower() == "all":
+        # Generate models for all databases
+        processed = set()
+        for item in dataset:
+            current_db_id = item.get("db_id")
+            if current_db_id in processed:
+                continue
+            processed.add(current_db_id)
+            tables = item.get("table", [])
+            schema = {
+                "table_names": [t["name"] for t in tables],
+                "column_names": [col for t in tables for col in t.get("columns", [])],
+                "foreign_keys": item.get("foreign_keys", []),
+            }
+            # Generate code (simplified)
+            code = generate_simplified_models(current_db_id, tables)
+            filepath = output_path / f"{current_db_id}.py"
+            with open(filepath, "w") as f:
+                f.write(code)
+            print(f"  {current_db_id}: {len(tables)} tables → {filepath}")
+    else:
+        # Filter for specific db_id
+        matching = [item for item in dataset if item.get("db_id") == db_id]
+        if not matching:
+            print(f"No schema found for db_id='{db_id}'")
+            return
+        item = matching[0]
+        tables = item.get("table", [])
+        # Generate simplified model code
+        code = generate_simplified_models(db_id, tables)
+        filepath = output_path / f"{db_id}.py"
+        with open(filepath, "w") as f:
+            f.write(code)
+        print(f"Found schema for db_id='{db_id}' with {len(tables)} tables")
+        print(f"Generated models → {filepath}")
+        print(f"\nTables: {', '.join(t['name'] for t in tables)}")
+def generate_simplified_models(db_id: str, tables: List[Dict[str, Any]]) -> str:
+    """Generate SQLAlchemy models from table schema (simplified version).
+    Args:
+        db_id: Database ID
+        tables: List of table definitions from schema
+    Returns:
+        Generated Python code
+    """
+    lines = [
+        f'"""',
+        f"SQLAlchemy ORM models for '{db_id}' database.",
+        f'",
+        f"Auto-generated from Spider schema dataset.",
+        f'"""',
+        f"",
+        f"from datetime import datetime",
+        f"from sqlalchemy import Column, Integer, String, Float, Date, DateTime, Boolean, ForeignKey",
+        f"from sqlalchemy.ext.declarative import declarative_base",
+        f"from sqlalchemy.orm import relationship",
+        f"",
+        f"Base = declarative_base()",
+        f"",
+    ]
+    for table in tables:
+        table_name = table.get("name", "Unknown")
+        class_name = "".join(word.capitalize() for word in table_name.split("_"))
+        lines.append(f"")
+        lines.append(f"class {class_name}(Base):")
+        lines.append(f'    """Model for {table_name} table."""')
+        lines.append(f'    __tablename__ = "{table_name}"')
+        lines.append(f"")
+        # Add columns
+        columns = table.get("columns", [])
+        if columns:
+            for col in columns:
+                col_name = col.get("name", "unknown")
+                col_type = col.get("type", "text")
+                sa_type = get_sqlalchemy_type(col_type)
+                # Determine string length from type if specified
+                length_spec = ""
+                if sa_type == "String":
+                    if "(" in col_type and ")" in col_type:
+                        length = col_type.split("(")[1].split(")")[0]
+                        if length.isdigit():
+                            length_spec = f"({length})"
+                    else:
+                        length_spec = "(255)"  # default
+                lines.append(f'    {col_name} = Column({sa_type}{length_spec}, nullable=True)')
+        else:
+            lines.append(f"    id = Column(Integer, primary_key=True)")
+        lines.append(f"")
+    return "\n".join(lines)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Download Spider schema and generate SQLAlchemy models",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--db-id",
+        type=str,
+        default="student_assessment",
+        help="Database ID to generate models for (or 'all' for all databases)",
+    )
+    parser.add_argument(
+        "--split",
+        type=str,
+        default="train",
+        choices=["train", "validation"],
+        help="Schema dataset split to use",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="data/models",
+        help="Directory to save generated model files",
+    )
+    args = parser.parse_args()
+    download_schema_and_generate_models(
+        db_id=args.db_id, split=args.split, output_dir=args.output_dir
+    )

server/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""SQLEnv server components."""
+from .sql_environment import SQLEnvironment
+__all__ = ["SQLEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+FastAPI application for the SQLEnv environment.
+Exposes the SQLEnvironment over HTTP and WebSocket endpoints,
+compatible with the OpenEnv EnvClient.
+Usage:
+    # Development (with auto-reload):
+    uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Via uv:
+    uv run server
+"""
+import os
+from pathlib import Path
+# Load environment variables from .env file
+try:
+    from dotenv import load_dotenv
+    env_file = Path(__file__).parent.parent / ".env"
+    if env_file.exists():
+        load_dotenv(env_file)
+except ImportError:
+    pass  # python-dotenv not installed, use system env vars
+from openenv.core.env_server import create_app
+try:
+    from sql_env.models import SQLAction, SQLObservation
+    from sql_env.server.sql_environment import SQLEnvironment
+except ImportError:
+    # Fallback for Docker where PYTHONPATH=/app/env
+    from models import SQLAction, SQLObservation  # type: ignore[no-redef]
+    from server.sql_environment import SQLEnvironment  # type: ignore[no-redef]
+def get_tokenizer():
+    """Get tokenizer from environment or use a mock for testing."""
+    tokenizer_name = os.environ.get(
+        "TOKENIZER_NAME", "mistralai/Mistral-7B-Instruct-v0.1"
+    )
+    try:
+        from transformers import AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+        print(f"Loaded tokenizer: {tokenizer_name}")
+        return tokenizer
+    except ImportError:
+        print(
+            "Warning: transformers not installed, using mock tokenizer for testing only"
+        )
+        from server.test_sql_env import MockTokenizer
+        return MockTokenizer()
+def create_sql_environment():
+    """Factory function that creates SQLEnvironment with tokenizer and paths."""
+    tokenizer = get_tokenizer()
+    questions_path = os.environ.get(
+        "QUESTIONS_PATH",
+        str(
+            Path(__file__).parent.parent
+            / "data"
+            / "questions"
+            / "student_assessment.json"
+        ),
+    )
+    db_dir = os.environ.get(
+        "DB_DIR",
+        str(Path(__file__).parent.parent / "data" / "databases"),
+    )
+    return SQLEnvironment(
+        questions_path=questions_path,
+        db_dir=db_dir,
+        tokenizer=tokenizer,
+    )
+# Create the FastAPI app
+app = create_app(
+    create_sql_environment,
+    SQLAction,
+    SQLObservation,
+    env_name="sql_env",
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """Entry point for running the server directly.
+    Enables:
+        uv run server
+        python -m sql_env.server.app
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/install_deps.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+# Additional setup for sql_env
+set -e
+# Install Python dependencies
+pip install --no-cache-dir -r /tmp/requirements.txt
+# Set up cache directory for Hugging Face models
+mkdir -p /.cache && chmod 777 /.cache
+# Pre-download the GPT-2 model to avoid permission issues during runtime
+python -c "from transformers import GPT2Tokenizer; GPT2Tokenizer.from_pretrained('gpt2')"

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi>=0.104.0
+openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git
+pydantic>=2.0.0
+torch==2.2.2
+transformers
+uvicorn>=0.24.0

server/reward.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""Reward helpers for SQLEnv dense shaping."""
+from __future__ import annotations
+import hashlib
+import math
+try:
+    from sql_env.models import EpisodeContext
+except ImportError:  # pragma: no cover - Docker fallback import path
+    from models import EpisodeContext  # type: ignore[no-redef]
+_EXEC_OK_REWARD = 0.02
+_NEW_INFO_REWARD = 0.01
+_NEW_INFO_CAP = 0.10
+_REPEAT_PENALTY = 0.01
+_STEP_COST = 0.005
+_LAYER2_CARDINALITY_WEIGHT = 0.25
+_LAYER2_VALUE_OVERLAP_WEIGHT = 0.50
+_LAYER2_NUMERIC_RANGE_WEIGHT = 0.25
+_LAYER2_IMPROVEMENT_SCALE = 0.15
+_STEP_REWARD_FLOOR = -0.2
+_STEP_REWARD_CAP = 0.5
+def compute_step_reward(
+    ctx: EpisodeContext,
+    action_type: str,
+    sql: str,
+    rows: list[tuple] | None,
+    error: str | None,
+) -> float:
+    """Compute one dense step reward and clamp cumulative episode shaping.
+    Combines Layer 1 operational shaping with Layer 2 progress shaping for
+    successful QUERY actions, then clamps cumulative step reward to
+    ``[-0.2, 0.5]`` and returns only the clamped delta for this step.
+    """
+    step_reward = _layer1_operational(ctx, action_type, sql, rows, error)
+    if action_type.upper() == "QUERY" and rows is not None and error is None:
+        step_reward += _layer2_progress(ctx, rows)
+    unclamped_total = ctx.cumulative_step_reward + step_reward
+    clamped_total = min(_STEP_REWARD_CAP, max(_STEP_REWARD_FLOOR, unclamped_total))
+    clamped_delta = clamped_total - ctx.cumulative_step_reward
+    ctx.cumulative_step_reward = clamped_total
+    return clamped_delta
+def _layer1_operational(
+    ctx: EpisodeContext,
+    action_type: str,
+    sql: str,
+    rows: list[tuple] | None,
+    error: str | None,
+) -> float:
+    """Compute Layer 1 operational reward signals.
+    Layer 1 applies:
+    - `+0.02` for successful execution (`error is None`)
+    - `+0.01` new-info for first-seen successful QUERY (capped at 0.10 cumulative)
+    - `-0.01` repeat penalty for repeated QUERY SQL
+    - `-0.005` step cost on every call
+    """
+    reward = -_STEP_COST
+    is_query = action_type.upper() == "QUERY"
+    query_hash: str | None = None
+    is_repeat = False
+    if is_query and sql:
+        query_hash = hashlib.sha256(sql.encode("utf-8")).hexdigest()
+        is_repeat = query_hash in ctx.query_hashes
+    if is_repeat:
+        reward -= _REPEAT_PENALTY
+    elif error is None:
+        reward += _EXEC_OK_REWARD
+    if (
+        is_query
+        and error is None
+        and rows is not None
+        and query_hash is not None
+        and not is_repeat
+    ):
+        ctx.query_hashes.add(query_hash)
+        if ctx.cumulative_new_info_reward < _NEW_INFO_CAP:
+            remaining = _NEW_INFO_CAP - ctx.cumulative_new_info_reward
+            delta = min(_NEW_INFO_REWARD, remaining)
+            ctx.cumulative_new_info_reward += delta
+            reward += delta
+    return reward
+def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
+    """Compute row-count similarity score in [0.0, 1.0]."""
+    pred_count = len(pred_rows)
+    gold_count = len(gold_rows)
+    denominator = max(pred_count, gold_count, 1)
+    score = 1.0 - (abs(pred_count - gold_count) / denominator)
+    return max(0.0, min(1.0, score))
+def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
+    """Compute Jaccard overlap of flattened cell values as strings."""
+    pred_values = {str(cell) for row in pred_rows for cell in row}
+    gold_values = {str(cell) for row in gold_rows for cell in row}
+    union = pred_values | gold_values
+    if not union:
+        return 0.0
+    intersection = pred_values & gold_values
+    return len(intersection) / len(union)
+def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
+    """Compute log-distance proximity for numeric cell values."""
+    def _is_numeric(value: object) -> bool:
+        return isinstance(value, (int, float)) and not isinstance(value, bool)
+    pred_numerics = [float(cell) for row in pred_rows for cell in row if _is_numeric(cell)]
+    gold_numerics = [float(cell) for row in gold_rows for cell in row if _is_numeric(cell)]
+    if not gold_numerics:
+        return 1.0
+    if not pred_numerics:
+        return 0.0
+    total = 0.0
+    for gold_value in gold_numerics:
+        closest_distance = min(abs(pred_value - gold_value) for pred_value in pred_numerics)
+        total += 1.0 / (1.0 + math.log1p(closest_distance))
+    return total / len(gold_numerics)
+def _bin_progress(raw_score: float) -> float:
+    """Bin raw progress to one of {0.0, 0.25, 0.5, 0.75, 1.0}."""
+    clamped_score = max(0.0, min(1.0, raw_score))
+    if clamped_score < 0.125:
+        return 0.0
+    if clamped_score < 0.375:
+        return 0.25
+    if clamped_score < 0.625:
+        return 0.5
+    if clamped_score < 0.875:
+        return 0.75
+    return 1.0
+def _layer2_progress(ctx: EpisodeContext, rows: list[tuple]) -> float:
+    """Compute Layer 2 progress reward with improvement-only gating."""
+    if not ctx.gold_rows:
+        return 0.0
+    cardinality = _cardinality_score(rows, ctx.gold_rows)
+    value_overlap = _value_overlap_score(rows, ctx.gold_rows)
+    numeric_range = _numeric_range_score(rows, ctx.gold_rows)
+    raw_progress = (
+        _LAYER2_CARDINALITY_WEIGHT * cardinality
+        + _LAYER2_VALUE_OVERLAP_WEIGHT * value_overlap
+        + _LAYER2_NUMERIC_RANGE_WEIGHT * numeric_range
+    )
+    binned_progress = _bin_progress(raw_progress)
+    if binned_progress <= ctx.best_progress:
+        return 0.0
+    progress_delta = binned_progress - ctx.best_progress
+    ctx.best_progress = binned_progress
+    return progress_delta * _LAYER2_IMPROVEMENT_SCALE

server/sql_environment.py ADDED Viewed

	@@ -0,0 +1,635 @@

+import json
+import logging
+from pathlib import Path
+import random
+import re
+import sqlite3
+import time
+import uuid
+from openenv.core.env_server.interfaces import Environment, Message, ModelTokenizer, Transform
+from .reward import compute_step_reward
+from .verifier import verify_answer
+try:
+    from sql_env.models import EpisodeContext, QuestionRecord, SQLAction, SQLObservation, SQLState
+except ImportError:
+    # Fallback for Docker where PYTHONPATH=/app/env
+    from models import (  # type: ignore[no-redef]
+        EpisodeContext,
+        QuestionRecord,
+        SQLAction,
+        SQLObservation,
+        SQLState,
+    )
+logger = logging.getLogger(__name__)
+_TABLE_FROM_JOIN_PATTERN = re.compile(
+    r"\b(?:FROM|JOIN)\s+([A-Za-z_][A-Za-z0-9_]*)", re.IGNORECASE
+)
+_FIRST_KEYWORD_PATTERN = re.compile(r"^[\s\n\r\t]*(\w+)")
+class SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState]):
+    """SQLEnv server implementation with a structured SQL action loop."""
+    def __init__(
+        self,
+        questions_path: str,
+        db_dir: str,
+        tokenizer: ModelTokenizer,
+        step_budget: int = 15,
+        transform: Transform | None = None,
+    ):
+        super().__init__(transform=transform)
+        if not hasattr(tokenizer, "apply_chat_template"):
+            raise ValueError("Tokenizer must have 'apply_chat_template' method")
+        if step_budget <= 0:
+            raise ValueError("step_budget must be a positive integer")
+        questions_file = Path(questions_path)
+        database_dir = Path(db_dir)
+        if not questions_file.exists():
+            raise FileNotFoundError(f"Questions file not found: {questions_file}")
+        if not database_dir.exists() or not database_dir.is_dir():
+            raise FileNotFoundError(f"Database directory not found: {database_dir}")
+        self.tokenizer = tokenizer
+        self.questions_path = questions_file
+        self.db_dir = database_dir
+        self.step_budget = step_budget
+        self.questions = self._load_questions(str(questions_file))
+        if not self.questions:
+            raise ValueError("Questions file contains no questions")
+        self._episode: EpisodeContext | None = None
+        self._last_result = ""
+        self._last_error = ""
+        self._last_reward: float | None = None
+        self._last_query_truncated = False
+        self._state = SQLState()
+    def _extract_tables_from_sql(self, sql: str) -> list[str]:
+        """Extract table names from basic FROM/JOIN clauses."""
+        tables: list[str] = []
+        for match in _TABLE_FROM_JOIN_PATTERN.findall(sql):
+            if match not in tables:
+                tables.append(match)
+        return tables
+    def _load_questions(self, path: str) -> list[QuestionRecord]:
+        """Load Spider questions JSON into QuestionRecord instances."""
+        questions_path = Path(path)
+        if not questions_path.exists():
+            raise FileNotFoundError(f"Questions file not found: {questions_path}")
+        try:
+            with questions_path.open("r", encoding="utf-8") as handle:
+                payload = json.load(handle)
+        except json.JSONDecodeError as exc:
+            raise ValueError(f"Invalid questions JSON format: {questions_path}") from exc
+        if not isinstance(payload, list):
+            raise ValueError("Questions JSON must be an array of records")
+        question_records: list[QuestionRecord] = []
+        for idx, item in enumerate(payload):
+            if not isinstance(item, dict):
+                raise ValueError(f"Question at index {idx} must be an object")
+            question_text = item.get("question")
+            db_name = item.get("db_id")
+            gold_sql = item.get("query")
+            if not isinstance(question_text, str) or not question_text.strip():
+                raise ValueError(f"Question at index {idx} missing non-empty 'question'")
+            if not isinstance(db_name, str) or not db_name.strip():
+                raise ValueError(f"Question at index {idx} missing non-empty 'db_id'")
+            if not isinstance(gold_sql, str) or not gold_sql.strip():
+                raise ValueError(f"Question at index {idx} missing non-empty 'query'")
+            normalized_db_name = db_name.strip()
+            if not re.fullmatch(r"[A-Za-z0-9_]+", normalized_db_name):
+                raise ValueError(
+                    f"Question at index {idx} has invalid db_id '{normalized_db_name}'"
+                )
+            question_records.append(
+                QuestionRecord(
+                    question_id=f"q-{idx}",
+                    question_text=question_text,
+                    database_name=normalized_db_name,
+                    gold_sql=gold_sql,
+                    gold_answer="",
+                    answer_type="string",
+                    difficulty="medium",
+                    tables_involved=self._extract_tables_from_sql(gold_sql),
+                )
+            )
+        return question_records
+    def _open_db(self, db_name: str) -> sqlite3.Connection:
+        """Open a read-only SQLite connection for the requested database."""
+        normalized_db_name = db_name.strip()
+        if not re.fullmatch(r"[A-Za-z0-9_]+", normalized_db_name):
+            raise ValueError(f"Invalid database name: '{db_name}'")
+        candidates = [
+            (self.db_dir / normalized_db_name / f"{normalized_db_name}.sqlite").resolve(),
+            (self.db_dir / f"{normalized_db_name}.sqlite").resolve(),
+        ]
+        db_root = self.db_dir.resolve()
+        db_path = next(
+            (
+                candidate
+                for candidate in candidates
+                if candidate.exists() and db_root in candidate.parents
+            ),
+            None,
+        )
+        if db_path is None:
+            raise FileNotFoundError(
+                f"Database '{normalized_db_name}' not found in {self.db_dir}"
+            )
+        uri = f"file:{db_path}?mode=ro"
+        return sqlite3.connect(uri, uri=True)
+    def _format_gold_answer(self, rows: list[tuple]) -> str:
+        """Convert SQL rows into a stable string answer for episode comparison."""
+        if not rows:
+            return ""
+        if len(rows) == 1 and len(rows[0]) == 1:
+            return str(rows[0][0])
+        return "\n".join(" | ".join(str(value) for value in row) for row in rows)
+    def _execute_gold_sql(
+        self,
+        connection: sqlite3.Connection,
+        sql: str,
+        timeout_s: float = 5.0,
+    ) -> list[tuple]:
+        """Execute gold SQL with read-only/SELECT-only timeout protections."""
+        sql_stripped = sql.strip()
+        if not sql_stripped:
+            raise ValueError("SQL query cannot be empty")
+        first_keyword_match = _FIRST_KEYWORD_PATTERN.match(sql_stripped)
+        first_keyword = (
+            first_keyword_match.group(1).upper() if first_keyword_match else ""
+        )
+        if first_keyword != "SELECT":
+            raise ValueError(f"Only SELECT queries are allowed. Got: {first_keyword}")
+        deadline = time.monotonic() + timeout_s
+        def _progress_callback() -> int:
+            return 1 if time.monotonic() > deadline else 0
+        connection.set_progress_handler(_progress_callback, 1000)
+        try:
+            cursor = connection.cursor()
+            cursor.execute(sql_stripped)
+            return cursor.fetchall()
+        except sqlite3.OperationalError as exc:
+            if "interrupted" in str(exc).lower():
+                raise sqlite3.OperationalError(
+                    f"Query timed out after {timeout_s:.1f} seconds"
+                ) from exc
+            raise
+        finally:
+            connection.set_progress_handler(None, 0)
+    def reset(
+        self,
+        *,
+        seed: int | None = None,
+        episode_id: str | None = None,
+        **kwargs,
+    ) -> SQLObservation:
+        """Reset episode context and return the initial rich observation."""
+        del kwargs
+        if self._episode is not None:
+            self._episode.db_connection.close()
+        chooser = random.Random(seed) if seed is not None else random
+        question = chooser.choice(self.questions)
+        connection = self._open_db(question.database_name)
+        try:
+            gold_rows = self._execute_gold_sql(connection, question.gold_sql)
+        except sqlite3.Error:
+            connection.close()
+            raise
+        gold_answer = self._format_gold_answer(gold_rows)
+        question_for_episode = QuestionRecord(
+            question_id=question.question_id,
+            question_text=question.question_text,
+            database_name=question.database_name,
+            gold_sql=question.gold_sql,
+            gold_answer=gold_answer,
+            answer_type=question.answer_type,
+            difficulty=question.difficulty,
+            tables_involved=list(question.tables_involved),
+        )
+        resolved_episode_id = episode_id or str(uuid.uuid4())
+        self._episode = EpisodeContext(
+            episode_id=resolved_episode_id,
+            db_connection=connection,
+            question_record=question_for_episode,
+            step_count=0,
+            budget=self.step_budget,
+            done=False,
+            gold_answer=gold_answer,
+            gold_rows=gold_rows,
+        )
+        self._state.episode_id = resolved_episode_id
+        self._state.step_count = 0
+        self._state.current_action_type = "QUERY"
+        self._state.history_messages = []
+        self._state.history_tokens = []
+        self._last_result = ""
+        self._last_error = ""
+        self._last_reward = None
+        self._last_query_truncated = False
+        return self._build_observation()
+    def _get_table_names(self, connection: sqlite3.Connection) -> list[str]:
+        """Return user-visible table names for the active SQLite database."""
+        cursor = connection.cursor()
+        cursor.execute(
+            """
+            SELECT name
+            FROM sqlite_master
+            WHERE type = 'table' AND name NOT LIKE 'sqlite_%'
+            ORDER BY name
+            """
+        )
+        return [str(row[0]) for row in cursor.fetchall()]
+    def _resolve_table_name(self, table_name: str) -> tuple[str | None, list[str]]:
+        """Resolve requested table name against active DB tables."""
+        if self._episode is None:
+            return None, []
+        available_tables = self._get_table_names(self._episode.db_connection)
+        lookup = {table.lower(): table for table in available_tables}
+        resolved = lookup.get(table_name.strip().lower())
+        return resolved, available_tables
+    def _format_rows(self, rows: list[tuple]) -> str:
+        """Format SQL rows as readable text."""
+        if not rows:
+            return "No rows returned."
+        lines = [f"{idx}. {' | '.join(str(value) for value in row)}" for idx, row in enumerate(rows, start=1)]
+        return "\n".join(lines)
+    def _execute_sql(self, sql: str, timeout_s: float = 5.0) -> list[tuple]:
+        """Execute SQL in sandbox: SELECT-only, single statement, timeout, truncation."""
+        if self._episode is None:
+            raise RuntimeError("No active episode. Call reset() before step().")
+        sql_stripped = sql.strip()
+        if not sql_stripped:
+            raise ValueError("SQL query cannot be empty")
+        first_keyword_match = _FIRST_KEYWORD_PATTERN.match(sql_stripped)
+        first_keyword = (
+            first_keyword_match.group(1).upper() if first_keyword_match else ""
+        )
+        if first_keyword != "SELECT":
+            raise ValueError(f"Only SELECT queries are allowed. Got: {first_keyword}")
+        single_statement_sql = sql_stripped.rstrip(";").strip()
+        if ";" in single_statement_sql:
+            raise ValueError("Only a single SELECT statement is allowed")
+        deadline = time.monotonic() + timeout_s
+        def _progress_callback() -> int:
+            return 1 if time.monotonic() > deadline else 0
+        connection = self._episode.db_connection
+        connection.set_progress_handler(_progress_callback, 1000)
+        self._last_query_truncated = False
+        try:
+            cursor = connection.cursor()
+            cursor.execute(sql_stripped)
+            rows = cursor.fetchmany(21)
+            if len(rows) > 20:
+                self._last_query_truncated = True
+                rows = rows[:20]
+            return rows
+        except sqlite3.OperationalError as exc:
+            if "interrupted" in str(exc).lower():
+                raise sqlite3.OperationalError(
+                    f"Query timed out after {timeout_s:.1f} seconds"
+                ) from exc
+            raise
+        finally:
+            connection.set_progress_handler(None, 0)
+    def _handle_describe(self, table_name: str) -> str:
+        """Return table schema and row count."""
+        if self._episode is None:
+            raise RuntimeError("No active episode. Call reset() before step().")
+        requested = table_name.strip()
+        if not requested:
+            raise ValueError("Argument cannot be empty for DESCRIBE")
+        resolved_table, available_tables = self._resolve_table_name(requested)
+        if resolved_table is None:
+            available = ", ".join(available_tables) if available_tables else "none"
+            raise ValueError(
+                f"Table '{requested}' not found. Available tables: {available}"
+            )
+        safe_identifier = resolved_table.replace('"', '""')
+        cursor = self._episode.db_connection.cursor()
+        cursor.execute(f'PRAGMA table_info("{safe_identifier}")')
+        columns = cursor.fetchall()
+        if not columns:
+            raise ValueError(f"Table '{resolved_table}' has no visible columns")
+        cursor.execute(f'SELECT COUNT(*) FROM "{safe_identifier}"')
+        row_count = int(cursor.fetchone()[0])
+        self._episode.described_tables.add(resolved_table)
+        lines = [f"Table '{resolved_table}' columns:"]
+        for _, col_name, col_type, _, _, _ in columns:
+            normalized_type = str(col_type).strip() or "UNKNOWN"
+            lines.append(f"- {col_name}: {normalized_type}")
+        lines.append(f"Row count: {row_count}")
+        return "\n".join(lines)
+    def _handle_sample(self, table_name: str, limit: int = 5) -> str:
+        """Return sample rows from a table."""
+        if self._episode is None:
+            raise RuntimeError("No active episode. Call reset() before step().")
+        requested = table_name.strip()
+        if not requested:
+            raise ValueError("Argument cannot be empty for SAMPLE")
+        resolved_table, available_tables = self._resolve_table_name(requested)
+        if resolved_table is None:
+            available = ", ".join(available_tables) if available_tables else "none"
+            raise ValueError(
+                f"Table '{requested}' not found. Available tables: {available}"
+            )
+        safe_identifier = resolved_table.replace('"', '""')
+        bounded_limit = max(1, min(limit, 20))
+        rows = self._execute_sql(
+            f'SELECT * FROM "{safe_identifier}" LIMIT {bounded_limit}'
+        )
+        return f"Sample from '{resolved_table}':\n{self._format_rows(rows)}"
+    def _handle_query(self, sql: str) -> tuple[str, list[tuple]]:
+        """Execute query and return formatted output with raw result rows."""
+        sql_text = sql.strip()
+        if not sql_text:
+            raise ValueError("Argument cannot be empty for QUERY")
+        rows = self._execute_sql(sql_text, timeout_s=5.0)
+        output = self._format_rows(rows)
+        if self._last_query_truncated:
+            output = f"{output}\n... (truncated to 20 rows)"
+        return output, rows
+    def _handle_answer(self, value: str) -> tuple[bool, float]:
+        """Compare submitted answer against episode gold answer."""
+        if self._episode is None:
+            raise RuntimeError("No active episode. Call reset() before step().")
+        is_correct = verify_answer(
+            predicted=value,
+            gold=self._episode.gold_answer or "",
+            answer_type=self._episode.question_record.answer_type,
+            gold_rows=self._episode.gold_rows,
+        )
+        self._episode.done = True
+        return is_correct, 1.0 if is_correct else 0.0
+    def step(
+        self,
+        action: SQLAction,
+        *,
+        timeout_s: float = 30,
+        **kwargs,
+    ) -> SQLObservation:
+        """Dispatch one structured action and return updated observation."""
+        del timeout_s
+        del kwargs
+        if self._episode is None:
+            self._last_result = ""
+            self._last_error = "No active episode. Call reset() before step()."
+            self._last_reward = None
+            return self._build_observation()
+        if self._episode.done:
+            return self._build_observation()
+        action_type = str(action.action_type).strip().upper()
+        argument = str(action.argument)
+        self._state.current_action_type = action_type or "QUERY"
+        self._last_result = ""
+        self._last_error = ""
+        self._last_reward = None
+        reward_rows: list[tuple] | None = []
+        reward_sql = ""
+        def _consume_invalid_step(error_text: str) -> SQLObservation:
+            self._last_error = error_text
+            self._episode.step_count += 1
+            self._episode.budget = max(0, self._episode.budget - 1)
+            self._episode.action_log.append(f"{action_type} -> ERROR: {error_text}")
+            if self._episode.budget == 0:
+                self._episode.done = True
+                self._last_reward = 0.0
+            self._state.step_count = self._episode.step_count
+            return self._build_observation()
+        valid_action_types = {"DESCRIBE", "SAMPLE", "QUERY", "ANSWER"}
+        if action_type not in valid_action_types:
+            return _consume_invalid_step(
+                f"Unknown action type '{action.action_type}'. "
+                "Valid types: DESCRIBE, SAMPLE, QUERY, ANSWER"
+            )
+        argument_stripped = argument.strip()
+        if not argument_stripped:
+            return _consume_invalid_step(
+                f"Argument cannot be empty for {action_type}"
+            )
+        try:
+            if action_type == "DESCRIBE":
+                self._last_result = self._handle_describe(argument_stripped)
+            elif action_type == "SAMPLE":
+                self._last_result = self._handle_sample(argument_stripped)
+            elif action_type == "QUERY":
+                reward_sql = argument_stripped
+                self._last_result, reward_rows = self._handle_query(argument_stripped)
+            else:
+                is_correct, reward = self._handle_answer(argument_stripped)
+                verdict = "correct" if is_correct else "incorrect"
+                self._last_result = f"Answer submitted: {verdict}."
+                self._last_reward = reward
+                self._episode.step_count += 1
+                self._episode.action_log.append(
+                    f"ANSWER {argument_stripped} -> {verdict}"
+                )
+                self._state.step_count = self._episode.step_count
+                return self._build_observation()
+        except ValueError as exc:
+            self._last_error = str(exc)
+        except sqlite3.Error as exc:
+            self._last_error = f"SQL error: {exc}"
+        self._episode.step_count += 1
+        self._episode.budget = max(0, self._episode.budget - 1)
+        self._state.step_count = self._episode.step_count
+        if self._episode.budget > 0:
+            self._last_reward = compute_step_reward(
+                ctx=self._episode,
+                action_type=action_type,
+                sql=reward_sql,
+                rows=reward_rows,
+                error=self._last_error or None,
+            )
+        if self._last_error:
+            self._episode.action_log.append(f"{action_type} -> ERROR: {self._last_error}")
+        else:
+            preview = self._last_result.splitlines()[0] if self._last_result else "ok"
+            self._episode.action_log.append(f"{action_type} -> {preview}")
+        if self._episode.budget == 0:
+            self._episode.done = True
+            if self._last_reward is None:
+                self._last_reward = 0.0
+        return self._build_observation()
+    def _build_observation(self) -> SQLObservation:
+        """Construct a rich observation from the current episode context."""
+        if self._episode is None:
+            observation = SQLObservation(
+                question="",
+                schema_info="",
+                result=self._last_result,
+                error=self._last_error,
+                step_count=0,
+                budget_remaining=0,
+                action_history=[],
+                done=False,
+                reward=self._last_reward,
+            )
+        else:
+            table_names = self._get_table_names(self._episode.db_connection)
+            known_tables = set(table_names)
+            schema_lines = ["Available tables:", *[f"- {name}" for name in table_names]]
+            if self._episode.described_tables:
+                schema_lines.append("")
+                schema_lines.append("Described tables:")
+                for table_name in sorted(self._episode.described_tables):
+                    if table_name not in known_tables:
+                        schema_lines.append(
+                            f"- {table_name}: unavailable (not in active schema)"
+                        )
+                        continue
+                    safe_identifier = table_name.replace('"', '""')
+                    cursor = self._episode.db_connection.cursor()
+                    cursor.execute(f'PRAGMA table_info("{safe_identifier}")')
+                    columns = cursor.fetchall()
+                    if not columns:
+                        schema_lines.append(f"- {table_name}: no columns available")
+                        continue
+                    column_summary = ", ".join(
+                        f"{str(column[1])} {str(column[2]) or 'UNKNOWN'}"
+                        for column in columns
+                    )
+                    schema_lines.append(f"- {table_name}: {column_summary}")
+            observation = SQLObservation(
+                question=self._episode.question_record.question_text,
+                schema_info="\n".join(schema_lines),
+                result=self._last_result,
+                error=self._last_error,
+                step_count=self._episode.step_count,
+                budget_remaining=self._episode.budget,
+                action_history=list(self._episode.action_log),
+                done=self._episode.done,
+                reward=self._last_reward,
+            )
+        transformed = self._apply_transform(observation)
+        if isinstance(transformed, SQLObservation):
+            return transformed
+        return SQLObservation(
+            question=getattr(transformed, "question", ""),
+            schema_info=getattr(transformed, "schema_info", ""),
+            result=getattr(transformed, "result", ""),
+            error=getattr(transformed, "error", ""),
+            step_count=getattr(transformed, "step_count", 0),
+            budget_remaining=getattr(transformed, "budget_remaining", 0),
+            action_history=getattr(transformed, "action_history", []),
+            done=transformed.done,
+            reward=transformed.reward,
+        )
+    @property
+    def state(self) -> SQLState:
+        """Get current exposed state metadata."""
+        return self._state
+    def message_to_action(self, message: Message) -> SQLAction:
+        """Convert free-form messages into structured SQLAction values."""
+        if "role" not in message:
+            raise ValueError("Message must contain a 'role' key")
+        if "content" not in message:
+            raise ValueError("Message must contain a 'content' key")
+        if message["content"] is None:
+            raise ValueError("Message content cannot be None")
+        content = str(message["content"])
+        parsed = content.strip()
+        action_type = "QUERY"
+        argument = content
+        if message["role"].lower() == "user" and parsed:
+            prefix, separator, remainder = parsed.partition(" ")
+            normalized_prefix = prefix.upper()
+            if normalized_prefix in {"DESCRIBE", "SAMPLE", "QUERY", "ANSWER"}:
+                action_type = normalized_prefix
+                if separator:
+                    argument = remainder
+                else:
+                    argument = ""
+        self._state.current_action_type = action_type
+        self._state.history_messages.append(message)
+        return SQLAction(action_type=action_type, argument=argument)

server/synthetic/__init__.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""Synthetic database generation utilities for metamorphic testing."""
+from .generate import VariantResult, generate_variant, generate_variants_for_question
+from .mutations import (
+    MutationResult,
+    TableSchema,
+    detect_bridge_tables,
+    duplicate_bridge_rows,
+    get_table_schemas,
+    inject_irrelevant_rows,
+    remap_ids,
+)
+__all__ = [
+    "MutationResult",
+    "TableSchema",
+    "VariantResult",
+    "detect_bridge_tables",
+    "duplicate_bridge_rows",
+    "generate_variant",
+    "generate_variants_for_question",
+    "get_table_schemas",
+    "inject_irrelevant_rows",
+    "remap_ids",
+]