sql_env / specs /F001-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Implementation Specification

Change: F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget) Date: 2026-03-24 Research Summary: specs/F001-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived in specs/behavior/sql-environment.md

Plan Status:

  • Draft
  • Approved for Implementation
  • Implementation Complete
  • Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Agents can play complete episodes: reset with a random question, explore a hidden schema via DESCRIBE/SAMPLE, run SQL queries, and submit answers. Currently SQL never executes -- this makes the environment actually functional.

Success Criteria:

  • Agent sends DESCRIBE employees and immediately sees column names and types
  • Queries execute in <100ms with clean truncated output (max 20 rows)
  • Bad SQL returns a clear error message the agent can learn from
  • Episode ends cleanly when budget exhausted or ANSWER submitted

Avoid:

  • Environment calling Ollama to interpret actions -- agent should own reasoning, env should just execute
  • Queries hanging or crashing the environment
  • Opaque error messages that don't help the agent adjust

Out of Scope:

  • Advanced reward computation (Phase 3 -- server/reward.py stub)
  • Answer verification beyond simple string comparison (Phase 2 -- server/verifier.py stub)
  • Synthetic data generation for databases
  • Multi-database episode support (single db per episode)
  • Token/message history management (existing OpenEnv pattern, not touched)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

  • Target: 3 slices
  • Hard max: <= 10 steps total
  • Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

  • Clear outcome
  • Minimal interface change
  • Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

  • !! Not Started
  • :: In Progress
  • OK Completed
  • XX Blocked/Failed

Result Outcome:

  • OK Fully Successful (all tests passed, no issues)
  • ?? Completed with Issues (needs follow-up)
  • XX Failed/Blocked

1. Implementation Overview

Summary

Replace the non-functional Ollama-based step/reset lifecycle with a working environment loop. Download Spider SQLite databases for real SQL execution. Rewrite models.py to use structured SQLAction (with argument field replacing action_description) and rich SQLObservation (with question, schema_info, result, error, step_count, budget_remaining, action_history). Implement EpisodeContext and QuestionRecord as server-side dataclasses. Wire reset() to pick a random question, open a read-only SQLite connection, compute the gold answer, and return an initial observation. Wire step() to dispatch structured actions to _handle_describe, _handle_sample, _handle_query, and _handle_answer handlers. Implement sandboxed SQL execution (_execute_sql) with SELECT-only validation, read-only connection, 5s timeout, and 20-row truncation. Enforce a 15-step budget. Update server/app.py factory and client.py to match the new interfaces. Remove Ollama dependency entirely.

Scope

In Scope:

  • Download Spider SQLite databases via script
  • QuestionRecord and EpisodeContext dataclasses in models.py
  • Rewrite SQLAction with argument field (replacing action_description)
  • Uncomment and populate rich SQLObservation fields
  • SQLEnvironment.__init__ with questions_path, db_dir, step_budget params
  • reset() with question selection, DB connection, gold answer computation
  • step() dispatching to four action handlers
  • _execute_sql() with sandboxing (read-only, SELECT-only, timeout, truncation)
  • _handle_describe(), _handle_sample(), _handle_query(), _handle_answer()
  • _build_observation() constructing rich observations
  • _load_questions() and _open_db() infrastructure
  • Update server/app.py factory function
  • Update client.py for new observation fields
  • Remove _call_ollama_to_select_table(), _call_ollama_for_sql(), _detect_action_type()
  • Refactor or remove message_to_action() (thin adapter if required by OpenEnv)

Out of Scope:

  • server/reward.py implementation (Phase 3)
  • server/verifier.py implementation beyond simple string comparison (Phase 2)
  • WebSocket-specific changes (OpenEnv handles this via create_app)
  • Token history management changes

1a. Execution Status

Progress: 8/8 steps complete Current Step: Finalization complete (verification passed) Last Updated: 2026-03-24T21:27:31Z Latest Result: OK Fully Successful (Step 3.2 completed; rewritten smoke suite validates structured action loop and all tests are green) Blockers: None


1b. Risk Assessment

Risk Tier: Medium

Risk Tier Definitions:

  • Low: Pure logic, non-user-facing, no security implications
  • Medium: User input handling, data validation, API changes
  • High: Authentication, payments, secrets management, untrusted input

High-Risk Indicators Present: (check all that apply if tier is High)

  • Touches authentication or authorization logic
  • Handles payment processing or financial data
  • Manages secrets, API keys, or credentials
  • Processes untrusted user input (file uploads, external APIs)
  • Modifies privilege/permission systems

Security Review Required: No

Justification: Agent-provided SQL is untrusted input, but mitigated by read-only SQLite connections, SELECT-only validation, and query timeout. No authentication, secrets, or payment logic involved. The SQL injection surface is intentionally constrained to read-only SELECT queries on a local SQLite file.


2. Change Manifest

Files to Create

File Purpose
scripts/download_spider_databases.py Script to download Spider SQLite database files from the Spider dataset

Files to Modify

File Changes
models.py Rewrite SQLAction (add argument, remove action_description). Uncomment rich SQLObservation fields. Add EpisodeContext, QuestionRecord dataclasses. Update SQLState.
server/sql_environment.py Complete rewrite of __init__, reset(), step(). Add _execute_sql, _handle_describe, _handle_sample, _handle_query, _handle_answer, _build_observation, _load_questions, _open_db. Remove Ollama methods. Refactor message_to_action.
server/app.py Update create_sql_environment() factory to pass questions_path and db_dir
client.py Update _parse_result() to handle rich SQLObservation fields
tests/test_smoke.py Rewrite tests for new structured action interface and SQL execution

Files to Delete

File Reason
(none) No files deleted; Ollama methods removed from sql_environment.py inline

3. Interface Specifications

New Types

# Location: models.py

from dataclasses import dataclass, field
import sqlite3

@dataclass
class QuestionRecord:
    """One question from the Spider dataset."""
    question_id: str
    question_text: str
    database_name: str
    gold_sql: str
    gold_answer: str                # Computed at load or reset by running gold_sql
    answer_type: str                # "integer" | "float" | "string" | "list"
    difficulty: str                 # "easy" | "medium" | "hard"
    tables_involved: list[str]


@dataclass
class EpisodeContext:
    """Per-episode server-side state (never sent to agent)."""
    episode_id: str
    db_connection: sqlite3.Connection
    question_record: QuestionRecord
    step_count: int = 0
    budget: int = 15
    described_tables: set[str] = field(default_factory=set)
    action_log: list[str] = field(default_factory=list)
    done: bool = False
    gold_answer: str | None = None  # Computed at reset by running gold_sql

Modified Types

# Location: models.py
# CHANGE: Replace action_description with argument; add ANSWER action type

class SQLAction(Action):
    """Structured action from agent to environment."""
    action_type: str = Field(
        ..., description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER"
    )
    argument: str = Field(
        ..., description="Table name (DESCRIBE/SAMPLE), SQL string (QUERY), or answer value (ANSWER)"
    )
    # REMOVED: action_description, tokens
# Location: models.py
# CHANGE: Uncomment rich observation fields, remove messages/tokens

class SQLObservation(Observation):
    """Rich observation from environment to agent."""
    # Inherited: done (bool), reward (float | None)
    question: str = Field(..., description="The NL question to answer")
    schema_info: str = Field(..., description="Known schema info (table names initially)")
    result: str = Field(default="", description="Result of last action (truncated)")
    error: str = Field(default="", description="Error message if action failed")
    step_count: int = Field(default=0, description="Current step number")
    budget_remaining: int = Field(default=0, description="Steps left")
    action_history: list[str] = Field(
        default_factory=list, description="Summary of previous actions"
    )

New Functions

# Location: server/sql_environment.py

class SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState]):

    def __init__(
        self,
        questions_path: str,
        db_dir: str,
        tokenizer: ModelTokenizer,
        step_budget: int = 15,
    ):
        """Initialize with path to questions JSON and database directory.

        Args:
            questions_path: Path to Spider questions JSON file
            db_dir: Directory containing Spider SQLite database files
            tokenizer: ModelTokenizer for OpenEnv compatibility
            step_budget: Maximum steps per episode (default 15)
        """

    def _load_questions(self, path: str) -> list[QuestionRecord]:
        """Load and parse question JSON into QuestionRecord list.

        Args:
            path: Path to questions JSON file (Spider format)

        Returns:
            List of QuestionRecord objects

        Raises:
            FileNotFoundError: If questions file does not exist
            ValueError: If JSON format is invalid
        """

    def _open_db(self, db_name: str) -> sqlite3.Connection:
        """Open read-only SQLite connection for a Spider database.

        Args:
            db_name: Database name (matches db_id in questions JSON)

        Returns:
            Read-only sqlite3.Connection

        Raises:
            FileNotFoundError: If database file does not exist
        """

    def _execute_sql(self, sql: str, timeout_s: float = 5.0) -> list[tuple]:
        """Sandboxed SQL execution: read-only, timeout, SELECT-only.

        Args:
            sql: SQL query to execute
            timeout_s: Maximum execution time in seconds

        Returns:
            List of result tuples

        Raises:
            ValueError: If SQL is not a SELECT statement
            sqlite3.OperationalError: If query fails or times out
        """

    def _handle_describe(self, table_name: str) -> str:
        """Return column names, types, row count for table.

        Args:
            table_name: Name of the table to describe

        Returns:
            Formatted string with column info, or error message if table not found
        """

    def _handle_sample(self, table_name: str, limit: int = 5) -> str:
        """Execute SELECT * FROM table LIMIT N, return formatted rows.

        Args:
            table_name: Name of the table to sample
            limit: Maximum rows to return (default 5)

        Returns:
            Formatted string with sample data, or error message if table not found
        """

    def _handle_query(self, sql: str) -> str:
        """Validate SELECT-only, execute with timeout, truncate to 20 rows.

        Args:
            sql: SQL SELECT query to execute

        Returns:
            Formatted result string, or error message
        """

    def _handle_answer(self, value: str) -> tuple[bool, float]:
        """Compare to gold answer, return (correct, reward).

        Args:
            value: Agent's answer string

        Returns:
            Tuple of (is_correct, reward_value)
        """

    def _build_observation(self) -> SQLObservation:
        """Construct SQLObservation from current episode context.

        Returns:
            Rich SQLObservation with question, schema, result, error, budget info
        """

Modified Functions

# Location: server/sql_environment.py
# CHANGE: New constructor signature with questions_path, db_dir, step_budget

def __init__(
    self,
    questions_path: str,      # NEW
    db_dir: str,               # NEW
    tokenizer: ModelTokenizer,
    step_budget: int = 15,     # NEW
):
    """Initialize with question dataset and database paths."""
# Location: server/sql_environment.py
# CHANGE: reset() now picks question, opens DB, computes gold answer

def reset(
    self,
    *,
    seed: int | None = None,
    episode_id: str | None = None,
    **kwargs,
) -> SQLObservation:
    """Pick random question, open read-only SQLite, return initial observation."""
# Location: server/sql_environment.py
# CHANGE: step() now dispatches structured actions, executes SQL

def step(
    self,
    action: SQLAction,
    *,
    timeout_s: float = 30,
    **kwargs,
) -> SQLObservation:
    """Dispatch to handler, update episode context, return observation."""
# Location: server/app.py
# CHANGE: Factory passes questions_path and db_dir

def create_sql_environment():
    """Factory function that creates SQLEnvironment with tokenizer and data paths."""
    tokenizer = get_tokenizer()
    questions_path = os.environ.get(
        "QUESTIONS_PATH",
        str(Path(__file__).parent.parent / "data" / "questions" / "student_assessment.json"),
    )
    db_dir = os.environ.get(
        "DB_DIR",
        str(Path(__file__).parent.parent / "data" / "databases"),
    )
    return SQLEnvironment(
        questions_path=questions_path,
        db_dir=db_dir,
        tokenizer=tokenizer,
    )

API Changes

The HTTP/WebSocket API is defined by OpenEnv's create_app() and does not change structurally. The payload shapes change:

# Endpoint: POST /step
# CHANGE: SQLAction now uses argument instead of action_description

Request:
  action_type: str  # "DESCRIBE" | "SAMPLE" | "QUERY" | "ANSWER"
  argument: str     # table name, SQL, or answer value

Response (SQLObservation):
  done: bool
  reward: float | null
  question: str
  schema_info: str
  result: str
  error: str
  step_count: int
  budget_remaining: int
  action_history: list[str]
# Endpoint: POST /reset
# CHANGE: Now returns rich observation with question and schema

Response (SQLObservation):
  done: false
  reward: null
  question: str           # The NL question for this episode
  schema_info: str        # Table names only (columns hidden until DESCRIBE)
  result: ""
  error: ""
  step_count: 0
  budget_remaining: 15
  action_history: []

4. Data Flow

Primary Flow: Reset

1. Client calls POST /reset
   - Input: optional seed, episode_id

2. SQLEnvironment.reset()
   - Close previous EpisodeContext.db_connection (if exists)
   - Pick random QuestionRecord from loaded questions (using seed if provided)
   - Open read-only SQLite via _open_db(question.database_name)
   - Execute gold_sql to compute gold_answer
   - Create new EpisodeContext (step_count=0, budget=15, done=False)

3. _build_observation()
   - Output: SQLObservation with question text, table names as schema_info,
     empty result/error, step_count=0, budget_remaining=15, empty action_history

Primary Flow: Step (QUERY)

1. Client calls POST /step with SQLAction(action_type="QUERY", argument="SELECT ...")
   - Input: structured action

2. SQLEnvironment.step(action)
   - Validate action_type is one of DESCRIBE/SAMPLE/QUERY/ANSWER
   - Check episode not done and budget > 0
   - Dispatch to _handle_query(sql)

3. _handle_query(sql)
   - Validate SQL starts with SELECT (case-insensitive, after stripping)
   - Call _execute_sql(sql, timeout_s=5.0)
   - Format results as text table, truncate to 20 rows
   - Return formatted result string

4. Update EpisodeContext
   - step_count += 1
   - budget -= 1
   - Append action summary to action_log
   - If budget == 0: done = True

5. _build_observation()
   - Output: SQLObservation with result, updated step_count/budget

Alternative Flows

When action_type is DESCRIBE:

1. _handle_describe(table_name)
2. If table_name not in database tables -> return error string listing available tables
3. Query sqlite_master or PRAGMA table_info for column names/types
4. Add table to described_tables set
5. Return formatted schema string

When action_type is SAMPLE:

1. _handle_sample(table_name, limit=5)
2. If table_name not in database tables -> return error string
3. Execute "SELECT * FROM {table_name} LIMIT 5" via _execute_sql
4. Return formatted rows

When action_type is ANSWER:

1. _handle_answer(value)
2. Compare value to gold_answer (case-insensitive string comparison for MVP)
3. Set done = True
4. Return (is_correct, 1.0 if correct else 0.0)
5. Do NOT decrement budget for ANSWER actions

When budget is exhausted:

1. Budget reaches 0 after step
2. Set done = True, reward = 0.0
3. Return terminal observation with done=True

When SQL is invalid:

1. _handle_query receives non-SELECT SQL
2. Return error: "Only SELECT queries are allowed. Got: {first_word}"
3. Step still counts against budget

When SQL times out:

1. _execute_sql exceeds 5s timeout
2. Interrupt query via progress_handler
3. Return error: "Query timed out after 5.0 seconds"

When step() called after episode is done:

1. Check self._episode.done is True
2. Return current observation unchanged (no state mutation)

5. Error Handling

Error Types

Error When Response User Message
Invalid action_type action_type not in {DESCRIBE, SAMPLE, QUERY, ANSWER} error field in observation "Unknown action type '{x}'. Valid types: DESCRIBE, SAMPLE, QUERY, ANSWER"
Table not found DESCRIBE/SAMPLE with nonexistent table error field in observation "Table '{x}' not found. Available tables: {list}"
Non-SELECT SQL QUERY with INSERT/UPDATE/DELETE/etc. error field in observation "Only SELECT queries are allowed. Got: {first_keyword}"
SQL syntax error Invalid SQL error field in observation "SQL error: {sqlite3_error_message}"
Query timeout Execution exceeds 5s error field in observation "Query timed out after 5.0 seconds"
Empty argument Blank argument field error field in observation "Argument cannot be empty for {action_type}"
Episode already done step() after termination Return current obs (no error -- observation unchanged, done=True)
Database file missing _open_db can't find .sqlite FileNotFoundError at reset "Database '{db_name}' not found in {db_dir}"
Questions file missing _load_questions can't find JSON FileNotFoundError at init "Questions file not found: {path}"

Error Handling Strategy

All action-level errors are returned in the error field of SQLObservation. The environment never raises exceptions from step() -- errors are part of the observation so the agent can learn from them.

Infrastructure errors (missing database files, missing questions file) raise Python exceptions at init/reset time since these are configuration failures, not agent errors.

# Pattern for action handlers:
def _handle_query(self, sql: str) -> str:
    sql_stripped = sql.strip()
    if not sql_stripped:
        return ""  # error set in step()

    # SELECT-only check
    first_word = sql_stripped.split()[0].upper()
    if first_word != "SELECT":
        return ""  # error set in step()

    try:
        rows = self._execute_sql(sql_stripped)
        return self._format_results(rows)
    except sqlite3.OperationalError as e:
        return ""  # error message set in step()

Retry Strategy

Operation Retry? Strategy
SQL execution No Single attempt; timeout kills long queries
DB connection open No Fail fast at reset(); configuration error
Question loading No Fail fast at init; file must exist

6. Slice Plan (What we will ship, in order)

Slice S1 -- Data & Types Foundation

Value: Database files exist, models are updated, environment can be instantiated with new constructor User-visible change: No (internal foundation) Interfaces introduced/changed: SQLAction.argument, rich SQLObservation, EpisodeContext, QuestionRecord, new __init__ signature Rollback safety: Additive -- new fields on models, old code paths not yet removed

Slice S2 -- Core Environment Loop

Value: reset() picks questions and opens databases; step() dispatches to handlers and executes real SQL; episodes run end-to-end User-visible change: Yes -- the environment is now functional Interfaces introduced/changed: reset(), step(), all _handle_* methods, _execute_sql, _build_observation Rollback safety: Replaces existing broken Ollama-based methods; rollback = revert commit

Slice S3 -- Integration & Cleanup

Value: Factory, client, and tests updated; Ollama code removed; environment fully wired User-visible change: Yes -- complete end-to-end agent interaction works Interfaces introduced/changed: create_sql_environment() factory, client _parse_result() Rollback safety: Final cleanup slice; rollback = revert commit


7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Download Spider SQLite Databases

Slice: S1 Goal: Create a script that downloads the actual Spider SQLite database files so the environment has real data to query.

Files:

  • scripts/download_spider_databases.py - create - Download script that fetches Spider database .sqlite files
  • data/databases/ - modified - Will contain downloaded .sqlite files (gitignored)

Interface Changes: None (infrastructure only)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T19:22:08Z Changes Made:

  • scripts/download_spider_databases.py created as a CLI utility to download one Spider SQLite database (--db-id) or all databases (--db-id all) into data/databases/
  • Added argument parsing (--db-id, --output-dir, --force) and reusable download helpers for raw single-file and archive-based bulk download
  • Added input/path hardening: db_id validation ([A-Za-z0-9_]+) and safe output-path boundary enforcement to prevent path traversal writes

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run python scripts/download_spider_databases.py --help
    Result: CLI usage printed successfully with expected options
    
    Command: uv run python scripts/download_spider_databases.py --db-id "../bad"
    Result: ValueError raised as expected for invalid db_id
    
    Command: uv run pytest tests/ -v
    Result: 21 passed in 4.73s
    
    Reviewer subagent verdict: APPROVE
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • Script should download the student_assessment database at minimum
    • Spider databases are typically at https://github.com/taoyds/spider or HuggingFace
    • The student_assessment.sqlite must match the ORM models in data/databases/models.py
  • Issues: Legacy environment/client/test code still targets removed wire fields (action_description, messages, tokens); resolved by planned S2/S3 steps.
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Database file(s) must exist at data/databases/student_assessment/student_assessment.sqlite before reset() can work

Step 1.2: Add QuestionRecord and EpisodeContext to models.py

Slice: S1 Goal: Implement the server-side dataclasses that hold per-episode state and question metadata.

Files:

  • models.py - modify - Add QuestionRecord and EpisodeContext dataclasses

Interface Changes:

  • New QuestionRecord dataclass
  • New EpisodeContext dataclass

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T19:26:22Z Changes Made:

  • models.py updated to add QuestionRecord dataclass with the full 8-field question metadata contract.
  • models.py updated to add EpisodeContext dataclass with server-side episode state, including safe mutable defaults for described_tables and action_log.
  • Added dataclass/sqlite imports and aliased dataclass field to dataclass_field to avoid conflicts with Pydantic Field.

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run pytest tests/ -v
    Result: 21 passed in 4.70s
    
    Reviewer subagent verdict: APPROVE
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • Keep conceptual comments in models.py for reference but implement the actual dataclasses
    • EpisodeContext.db_connection is sqlite3.Connection -- not serializable, server-only
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • QuestionRecord and EpisodeContext now exist as concrete server-side types; proceed to wire-level model rewrite in Step 1.3 (SQLAction.argument and rich SQLObservation fields).

Step 1.3: Rewrite SQLAction and SQLObservation

Slice: S1 Goal: Update wire types to use structured argument field and rich observation fields.

Files:

  • models.py - modify - Rewrite SQLAction (replace action_description with argument, remove tokens), uncomment and update SQLObservation rich fields, remove messages/tokens

Interface Changes:

  • SQLAction.action_description -> SQLAction.argument
  • SQLAction.tokens removed
  • SQLObservation gains: question, schema_info, result, error, step_count, budget_remaining, action_history
  • SQLObservation.messages and SQLObservation.tokens removed

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Medium

Breaking API change -- client must be updated in S3

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T19:32:08Z Changes Made:

  • models.py updated to replace SQLAction.action_description with SQLAction.argument, and remove SQLAction.tokens from the wire contract.
  • models.py updated to replace legacy SQLObservation.messages/tokens payload shape with rich observation fields: question, schema_info, result, error, step_count, budget_remaining, action_history.
  • models.py updated SQLState.current_action_type default/description to align with normalized action vocabulary (DESCRIBE, SAMPLE, QUERY, ANSWER).

Result:

  • Outcome: ?? Completed with Issues
  • Evidence Captured:
    Command: uv run pytest tests/ -v
    Result: 15 failed, 6 passed in 5.30s
    
    Failure pattern: expected legacy contract mismatch in tests/environment/client still using
    action_description/messages/tokens. This is expected after Step 1.3 wire-model rewrite and
    will be resolved by the planned S2/S3 environment/client/test rewrites.
    
    Reviewer subagent verdict: APPROVE
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • This is a breaking change to the wire protocol
    • Existing tests fail after this step until Step 2.x/3.x updates environment/client/tests to the new contract
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Wire contracts are now in place; next step is Step 2.1 to rewrite environment constructor and data loading/open-db infrastructure to match the new model interfaces.

Step 2.1: Rewrite SQLEnvironment constructor, _load_questions, _open_db

Slice: S2 Goal: New constructor that accepts questions_path and db_dir, loads questions at init, and provides _open_db for reset.

Files:

  • server/sql_environment.py - modify - Rewrite __init__, add _load_questions(), add _open_db()

Interface Changes:

  • SQLEnvironment.__init__(questions_path, db_dir, tokenizer, step_budget) replaces old constructor
  • New _load_questions(path) -> list[QuestionRecord]
  • New _open_db(db_name) -> sqlite3.Connection

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T19:44:22Z Changes Made:

  • server/sql_environment.py constructor rewritten to require questions_path, db_dir, tokenizer, and step_budget, with validation for missing paths and non-positive step budgets.
  • Added _load_questions(path) to parse Spider question JSON into QuestionRecord values with schema-safe db_id validation and derived tables_involved from FROM/JOIN clauses.
  • Added _open_db(db_name) read-only opener using file:{path}?mode=ro, with db-name allowlist validation and resolved-path containment checks to prevent path traversal outside db_dir.
  • Removed runtime dependency on Ollama HTTP calls in helper methods to keep this step local and deterministic while Step 2.3 rewires query execution fully.

Result:

  • Outcome: ?? Completed with Issues
  • Evidence Captured:
    Command: uv run ruff check server/sql_environment.py
    Result: All checks passed
    
    Command: uv run pytest tests/ -v
    Result: 21 failed in 4.97s
    Failure pattern: expected legacy smoke suite mismatch (tests still assert old
    constructor and wire contract: system_prompt/action_description/messages/tokens).
    
    Reviewer subagent verdict: APPROVE
    Reviewer notes: previously reported _open_db path-traversal risk resolved via
    db_name allowlist + resolved path containment checks.
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • Remove Ollama config (ollama_model, ollama_base_url)
    • Remove requests import
    • Keep self.db_models dict for _handle_describe fallback but prefer PRAGMA table_info on the live SQLite connection
    • _open_db opens with URI file:{path}?mode=ro
  • Issues: Legacy smoke tests still target pre-S2 interfaces and will be rewritten in Step 3.2.
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Constructor, question loading, and DB opening are ready with path/input guards; proceed to Step 2.2 to implement reset lifecycle and rich observation building.

Step 2.2: Implement reset() and _build_observation()

Slice: S2 Goal: reset() picks a random question, opens the database, computes the gold answer, creates EpisodeContext, and returns the initial observation via _build_observation().

Files:

  • server/sql_environment.py - modify - Rewrite reset(), add _build_observation()

Interface Changes:

  • reset(*, seed, episode_id, **kwargs) -> SQLObservation
  • _build_observation() -> SQLObservation

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T19:54:13Z Changes Made:

  • server/sql_environment.py reset lifecycle rewritten to select a question deterministically with optional seed, close any previous episode connection, open a read-only SQLite DB, compute the question gold answer, and initialize EpisodeContext with configured budget and optional episode_id.
  • Added _build_observation() to construct rich SQLObservation payloads from live episode context, including question text, schema table listing, budget/step counters, action history, and progressive described-table column info.
  • Added reset support helpers _get_table_names(), _format_gold_answer(), and _execute_gold_sql() (SELECT-only + timeout guarded) plus a temporary _create_observation() wrapper for compatibility until Step 2.3 rewrites step().

Result:

  • Outcome: ?? Completed with Issues
  • Evidence Captured:
    Command: uv run ruff check server/sql_environment.py
    Result: All checks passed
    
    Command: uv run pytest tests/ -v
    Result: 21 failed in 4.62s
    Failure pattern: legacy smoke suite still targets pre-migration interfaces
    (system_prompt/action_description/messages/tokens) and is expected to be
    rewritten in Step 3.2.
    
    Command: uv run python scripts/download_spider_databases.py --db-id student_assessment
    Result: RuntimeError due to upstream Spider raw URL 404 in current downloader.
    
    Reviewer subagent verdict: APPROVE (Step 2.2 scope)
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • Initial schema_info now lists table names only; described table column details are appended progressively.
    • Question selection uses random.Random(seed) when a seed is supplied for deterministic reset behavior.
    • Gold answer computation now runs through a timeout-protected, SELECT-only SQL path (_execute_gold_sql) on the read-only connection.
  • Issues: Local workspace currently lacks Spider SQLite fixtures; full reset runtime validation depends on Step 1.1 data download script path fix or local DB provisioning.
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • reset() works; step() handlers can now be implemented

Step 2.3: Implement _execute_sql and action handlers, rewrite step()

Slice: S2 Goal: Implement sandboxed SQL execution and all four action handlers. Rewrite step() to dispatch structured actions, enforce budget, and handle episode termination.

Files:

  • server/sql_environment.py - modify - Add _execute_sql(), _handle_describe(), _handle_sample(), _handle_query(), _handle_answer(). Rewrite step(). Remove _call_ollama_to_select_table(), _call_ollama_for_sql(), _detect_action_type(), _generate_sample_query(), _create_observation().

Interface Changes:

  • step(action, *, timeout_s, **kwargs) -> SQLObservation
  • _execute_sql(sql, timeout_s) -> list[tuple]
  • _handle_describe(table_name) -> str
  • _handle_sample(table_name, limit) -> str
  • _handle_query(sql) -> str
  • _handle_answer(value) -> tuple[bool, float]

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Medium

Processes untrusted SQL input; must enforce read-only + SELECT-only + timeout

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T21:10:26Z Changes Made:

  • server/sql_environment.py rewritten to implement _execute_sql (SELECT-only validation, single-statement guard, SQLite progress-handler timeout, 20-row truncation) and all structured handlers (_handle_describe, _handle_sample, _handle_query, _handle_answer).
  • server/sql_environment.py step() rewritten to dispatch on DESCRIBE/SAMPLE/QUERY/ANSWER, return observation-level errors instead of raising, enforce step budget/termination, and keep ANSWER as non-budget-consuming on valid submissions.
  • Removed legacy Ollama-era action helpers (_call_ollama_to_select_table, _call_ollama_for_sql, _generate_sample_query, _detect_action_type, _create_observation) and converted message_to_action() into a thin structured-action adapter.
  • Applied reviewer-requested hardening: invalid action types and empty arguments now consume budget/step count to prevent malformed-action budget bypass loops.

Result:

  • Outcome: ?? Completed with Issues
  • Evidence Captured:
    Command: uv run ruff check server/sql_environment.py
    Result: All checks passed
    
    Command: uv run pytest tests/ -v
    Result: 21 failed in 6.42s
    Failure pattern: legacy smoke suite still asserts pre-migration interfaces
    (system_prompt constructor, action_description/messages/tokens contract).
    
    Reviewer subagent verdict: REQUEST_CHANGES
    Reviewer finding addressed: malformed-action budget bypass fixed by charging
    invalid action type / empty-argument attempts against budget.
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • _execute_sql should use connection.set_progress_handler(callback, N) for timeout
    • SELECT-only validation: strip, split on whitespace, check first token is SELECT
    • _handle_describe: use PRAGMA table_info(table_name) on live connection
    • _handle_sample: SELECT * FROM {table_name} LIMIT {limit} via _execute_sql
    • _handle_query: validate SELECT-only, execute, format, truncate to 20 rows
    • _handle_answer: simple string comparison (case-insensitive, stripped) for MVP
    • Budget decrement on DESCRIBE, SAMPLE, QUERY only (not ANSWER)
    • Refactor or remove message_to_action() -- keep as thin adapter if OpenEnv requires it
  • Issues: Legacy smoke tests remain out-of-date with post-1.3/2.x contracts and will be rewritten in Step 3.2.
  • Follow-ups Created: None
  • Human Review Completed: N/A (Medium risk but sandboxing is well-specified)

Context for Next Step:

  • Environment core loop now executes structured actions against SQLite with sandboxing; proceed to Step 3.1 to wire the new constructor and observation contract through server/app.py and client.py.

Step 3.1: Update app.py factory and client.py

Slice: S3 Goal: Wire the new constructor signature into the factory and update the client to handle rich observations.

Files:

  • server/app.py - modify - Update create_sql_environment() to pass questions_path, db_dir
  • client.py - modify - Update _parse_result() for new SQLObservation fields

Interface Changes:

  • create_sql_environment() passes new constructor params
  • Client handles question, schema_info, result, error, step_count, budget_remaining, action_history fields

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T21:17:18Z Changes Made:

  • server/app.py updated create_sql_environment() to remove legacy SYSTEM_PROMPT wiring and pass questions_path (QUESTIONS_PATH env var with project default) and db_dir (DB_DIR env var with project default) into SQLEnvironment.
  • client.py updated _step_payload() to send the structured wire contract (action_type, argument) instead of legacy action_description/tokens.
  • client.py updated _parse_result() to deserialize rich SQLObservation fields (question, schema_info, result, error, step_count, budget_remaining, action_history) with robust fallback when observation is absent.
  • client.py message_to_action() updated to emit structured SQLAction(action_type, argument) and support explicit prefixed actions (DESCRIBE, SAMPLE, QUERY, ANSWER).

Result:

  • Outcome: ?? Completed with Issues
  • Evidence Captured:
    Command: uv run ruff check server/app.py client.py
    Result: All checks passed
    
    Command: uv run pytest tests/ -v
    Result: 21 failed in 6.62s
    Failure pattern: legacy smoke suite still asserts pre-migration contracts
    (system_prompt constructor, action_description/messages/tokens fields).
    
    Reviewer subagent verdict: APPROVE
    Reviewer note: targeted contract checks for step payload parsing and app
    factory wiring passed for Step 3.1 scope.
    
  • Tests run: uv run ruff check server/app.py client.py, uv run pytest tests/ -v
  • Notes:
    • Use env vars QUESTIONS_PATH and DB_DIR with sensible defaults
    • Remove system_prompt env var from factory (no longer needed)
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Rewrite tests/test_smoke.py for the structured action/observation contract and new environment constructor to clear the current legacy-suite failures.

Step 3.2: Rewrite tests

Slice: S3 Goal: Update test_smoke.py for structured actions, real SQL execution, and rich observations. Remove tests for Ollama-based methods.

Files:

  • tests/test_smoke.py - modify - Rewrite test classes for new interface

Interface Changes: None (tests only)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-24T21:27:31Z Changes Made:

  • tests/test_smoke.py fully rewritten from legacy chat/token contract tests to structured action loop coverage for SQLAction.argument and rich SQLObservation fields.
  • Added deterministic temp SQLite + questions fixtures used by environment lifecycle tests (reset, DESCRIBE, SAMPLE, QUERY, ANSWER, budget exhaustion, post-terminal behavior).
  • Added sandbox behavior assertions for SELECT-only rejection, query truncation to 20 rows, timeout-path error handling, and read-only DB enforcement.
  • Updated client-contract tests for _step_payload(), _parse_result(), _parse_state(), and client message_to_action() inference.

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run pytest tests/ -v
    Result: 25 passed in 6.49s
    
    Coverage notes:
    - Structured action contract (DESCRIBE/SAMPLE/QUERY/ANSWER) validated
    - Rich observation fields validated on reset/step
    - SQL sandbox guards covered (non-SELECT rejection, timeout path, read-only)
    - Step budget and terminal behavior covered
    
  • Tests run: uv run pytest tests/ -v
  • Notes:
    • Replaced legacy Ollama/message-token assumptions with tests aligned to the current environment architecture
    • Tests use local temporary fixtures and do not require Spider database downloads
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • All implementation steps complete and verified; ready for commit/push/PR workflow

8. Rollout Considerations

Feature Flags

  • Required: No
  • Flag name: N/A

Migration

  • Data migration needed: No
  • Migration strategy: N/A

The Spider database download is a one-time setup step via scripts/download_spider_databases.py.

Rollback Plan

Revert the feature branch. The environment returns to the Ollama-based non-functional state. No data migration is involved.


9. Execution Tracking

All execution state is tracked within this document:

  • Section 1a: Overall progress summary
  • Section 7: Per-step completion details, test results, and handoff context
  • FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
  • Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

  • Checking Section 1a for summary
  • Reviewing Section 7 for detailed step status
  • Inspecting the feature's progress and status fields in FEATURES.json
  • Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

  1. Run verifier subagent for spec compliance

    • Validates against VERIFICATION_SPEC.md criteria
    • Ensures no TODOs or incomplete work in slice
  2. Run compound-engineer subagent to extract learnings

    • Mandatory invocation after every slice completion
    • Updates CLAUDE.md Learnings section (if durable patterns found)
    • May exit with "no update needed" (valid for routine work)
  3. Commit the slice changes

    • Follow commit message format in CLAUDE.md
    • Each slice gets its own atomic commit
  4. Continue to next slice (if more slices remain)

    • Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.


10. User Value Summary

Status: Generated

What Users Can Now Do

Agents can now run full SQL exploration episodes end-to-end: reset into a real question/database pair, inspect schema with DESCRIBE/SAMPLE, execute SELECT queries safely, and submit terminal ANSWER actions for reward.

How to Access/Test

Run uv run pytest tests/ -v for automated coverage, or start the environment with uv run uvicorn server.app:app --reload and call /reset then /step using structured actions (DESCRIBE, SAMPLE, QUERY, ANSWER).

Demo

  • Command: uv run pytest tests/ -v

Release Notes Snippet

Implemented the core SQL environment loop with structured actions, live read-only SQLite execution, step-budget termination, and updated client/test contracts.


11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Scope

  • Complete F001 core environment loop migration from Ollama-driven behavior to deterministic structured SQL execution.
  • Include model, server loop, app/client wiring, and rewritten smoke coverage aligned to the new wire contract.

Verification

  • uv run pytest tests/ -v -> 25 passed, 0 failed.
  • Verification mode: standard.

Risk / Rollback

  • Risk tier: Medium (untrusted SQL input), mitigated via read-only DB mode, SELECT-only validation, and timeout enforcement.
  • Rollback: revert feature branch commits for F001.

Ready For

  • /commit-push-pr

PR Created


Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

  • A step requires touching more than 3 files in unrelated areas
  • You need to introduce multiple new abstractions "just in case"
  • Verification cannot be made targeted and concrete
  • You discover new unknowns that change the plan materially
  • The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.


Human Checkpoint

Before handing to AI agent:

  • Interface specifications are complete
  • Data flow is accurate
  • Error handling is specified
  • Implementation order makes sense
  • VERIFICATION_SPEC.md has been generated

Questions:

  1. Any remaining concerns?
  2. Anything agent should know?

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly

Specification completed: 2026-03-24 Approved by: [NAME/ROLE] Verification spec: VERIFICATION_SPEC.md Target agent: Claude Code