Implementation Specification
Change: F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget) Date: 2026-03-24 Research Summary: specs/F001-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived in specs/behavior/sql-environment.md
Plan Status:
- Draft
- Approved for Implementation
- Implementation Complete
- Verification Passed
Core Intent (Immutable)
DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.
User Problem: Agents can play complete episodes: reset with a random question, explore a hidden schema via DESCRIBE/SAMPLE, run SQL queries, and submit answers. Currently SQL never executes -- this makes the environment actually functional.
Success Criteria:
- Agent sends DESCRIBE employees and immediately sees column names and types
- Queries execute in <100ms with clean truncated output (max 20 rows)
- Bad SQL returns a clear error message the agent can learn from
- Episode ends cleanly when budget exhausted or ANSWER submitted
Avoid:
- Environment calling Ollama to interpret actions -- agent should own reasoning, env should just execute
- Queries hanging or crashing the environment
- Opaque error messages that don't help the agent adjust
Out of Scope:
- Advanced reward computation (Phase 3 --
server/reward.pystub) - Answer verification beyond simple string comparison (Phase 2 --
server/verifier.pystub) - Synthetic data generation for databases
- Multi-database episode support (single db per episode)
- Token/message history management (existing OpenEnv pattern, not touched)
0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in small, mergeable increments.
Scope Budget
- Target: 3 slices
- Hard max: <= 10 steps total
- Each step must end in: implement -> verify -> merge
Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
Each slice must have:
- Clear outcome
- Minimal interface change
- Merge criteria
Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
Status Icons
Step Status:
- !! Not Started
- :: In Progress
- OK Completed
- XX Blocked/Failed
Result Outcome:
- OK Fully Successful (all tests passed, no issues)
- ?? Completed with Issues (needs follow-up)
- XX Failed/Blocked
1. Implementation Overview
Summary
Replace the non-functional Ollama-based step/reset lifecycle with a working environment loop. Download Spider SQLite databases for real SQL execution. Rewrite models.py to use structured SQLAction (with argument field replacing action_description) and rich SQLObservation (with question, schema_info, result, error, step_count, budget_remaining, action_history). Implement EpisodeContext and QuestionRecord as server-side dataclasses. Wire reset() to pick a random question, open a read-only SQLite connection, compute the gold answer, and return an initial observation. Wire step() to dispatch structured actions to _handle_describe, _handle_sample, _handle_query, and _handle_answer handlers. Implement sandboxed SQL execution (_execute_sql) with SELECT-only validation, read-only connection, 5s timeout, and 20-row truncation. Enforce a 15-step budget. Update server/app.py factory and client.py to match the new interfaces. Remove Ollama dependency entirely.
Scope
In Scope:
- Download Spider SQLite databases via script
QuestionRecordandEpisodeContextdataclasses inmodels.py- Rewrite
SQLActionwithargumentfield (replacingaction_description) - Uncomment and populate rich
SQLObservationfields SQLEnvironment.__init__withquestions_path,db_dir,step_budgetparamsreset()with question selection, DB connection, gold answer computationstep()dispatching to four action handlers_execute_sql()with sandboxing (read-only, SELECT-only, timeout, truncation)_handle_describe(),_handle_sample(),_handle_query(),_handle_answer()_build_observation()constructing rich observations_load_questions()and_open_db()infrastructure- Update
server/app.pyfactory function - Update
client.pyfor new observation fields - Remove
_call_ollama_to_select_table(),_call_ollama_for_sql(),_detect_action_type() - Refactor or remove
message_to_action()(thin adapter if required by OpenEnv)
Out of Scope:
server/reward.pyimplementation (Phase 3)server/verifier.pyimplementation beyond simple string comparison (Phase 2)- WebSocket-specific changes (OpenEnv handles this via
create_app) - Token history management changes
1a. Execution Status
Progress: 8/8 steps complete Current Step: Finalization complete (verification passed) Last Updated: 2026-03-24T21:27:31Z Latest Result: OK Fully Successful (Step 3.2 completed; rewritten smoke suite validates structured action loop and all tests are green) Blockers: None
1b. Risk Assessment
Risk Tier: Medium
Risk Tier Definitions:
- Low: Pure logic, non-user-facing, no security implications
- Medium: User input handling, data validation, API changes
- High: Authentication, payments, secrets management, untrusted input
High-Risk Indicators Present: (check all that apply if tier is High)
- Touches authentication or authorization logic
- Handles payment processing or financial data
- Manages secrets, API keys, or credentials
- Processes untrusted user input (file uploads, external APIs)
- Modifies privilege/permission systems
Security Review Required: No
Justification: Agent-provided SQL is untrusted input, but mitigated by read-only SQLite connections, SELECT-only validation, and query timeout. No authentication, secrets, or payment logic involved. The SQL injection surface is intentionally constrained to read-only SELECT queries on a local SQLite file.
2. Change Manifest
Files to Create
| File | Purpose |
|---|---|
scripts/download_spider_databases.py |
Script to download Spider SQLite database files from the Spider dataset |
Files to Modify
| File | Changes |
|---|---|
models.py |
Rewrite SQLAction (add argument, remove action_description). Uncomment rich SQLObservation fields. Add EpisodeContext, QuestionRecord dataclasses. Update SQLState. |
server/sql_environment.py |
Complete rewrite of __init__, reset(), step(). Add _execute_sql, _handle_describe, _handle_sample, _handle_query, _handle_answer, _build_observation, _load_questions, _open_db. Remove Ollama methods. Refactor message_to_action. |
server/app.py |
Update create_sql_environment() factory to pass questions_path and db_dir |
client.py |
Update _parse_result() to handle rich SQLObservation fields |
tests/test_smoke.py |
Rewrite tests for new structured action interface and SQL execution |
Files to Delete
| File | Reason |
|---|---|
| (none) | No files deleted; Ollama methods removed from sql_environment.py inline |
3. Interface Specifications
New Types
# Location: models.py
from dataclasses import dataclass, field
import sqlite3
@dataclass
class QuestionRecord:
"""One question from the Spider dataset."""
question_id: str
question_text: str
database_name: str
gold_sql: str
gold_answer: str # Computed at load or reset by running gold_sql
answer_type: str # "integer" | "float" | "string" | "list"
difficulty: str # "easy" | "medium" | "hard"
tables_involved: list[str]
@dataclass
class EpisodeContext:
"""Per-episode server-side state (never sent to agent)."""
episode_id: str
db_connection: sqlite3.Connection
question_record: QuestionRecord
step_count: int = 0
budget: int = 15
described_tables: set[str] = field(default_factory=set)
action_log: list[str] = field(default_factory=list)
done: bool = False
gold_answer: str | None = None # Computed at reset by running gold_sql
Modified Types
# Location: models.py
# CHANGE: Replace action_description with argument; add ANSWER action type
class SQLAction(Action):
"""Structured action from agent to environment."""
action_type: str = Field(
..., description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER"
)
argument: str = Field(
..., description="Table name (DESCRIBE/SAMPLE), SQL string (QUERY), or answer value (ANSWER)"
)
# REMOVED: action_description, tokens
# Location: models.py
# CHANGE: Uncomment rich observation fields, remove messages/tokens
class SQLObservation(Observation):
"""Rich observation from environment to agent."""
# Inherited: done (bool), reward (float | None)
question: str = Field(..., description="The NL question to answer")
schema_info: str = Field(..., description="Known schema info (table names initially)")
result: str = Field(default="", description="Result of last action (truncated)")
error: str = Field(default="", description="Error message if action failed")
step_count: int = Field(default=0, description="Current step number")
budget_remaining: int = Field(default=0, description="Steps left")
action_history: list[str] = Field(
default_factory=list, description="Summary of previous actions"
)
New Functions
# Location: server/sql_environment.py
class SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState]):
def __init__(
self,
questions_path: str,
db_dir: str,
tokenizer: ModelTokenizer,
step_budget: int = 15,
):
"""Initialize with path to questions JSON and database directory.
Args:
questions_path: Path to Spider questions JSON file
db_dir: Directory containing Spider SQLite database files
tokenizer: ModelTokenizer for OpenEnv compatibility
step_budget: Maximum steps per episode (default 15)
"""
def _load_questions(self, path: str) -> list[QuestionRecord]:
"""Load and parse question JSON into QuestionRecord list.
Args:
path: Path to questions JSON file (Spider format)
Returns:
List of QuestionRecord objects
Raises:
FileNotFoundError: If questions file does not exist
ValueError: If JSON format is invalid
"""
def _open_db(self, db_name: str) -> sqlite3.Connection:
"""Open read-only SQLite connection for a Spider database.
Args:
db_name: Database name (matches db_id in questions JSON)
Returns:
Read-only sqlite3.Connection
Raises:
FileNotFoundError: If database file does not exist
"""
def _execute_sql(self, sql: str, timeout_s: float = 5.0) -> list[tuple]:
"""Sandboxed SQL execution: read-only, timeout, SELECT-only.
Args:
sql: SQL query to execute
timeout_s: Maximum execution time in seconds
Returns:
List of result tuples
Raises:
ValueError: If SQL is not a SELECT statement
sqlite3.OperationalError: If query fails or times out
"""
def _handle_describe(self, table_name: str) -> str:
"""Return column names, types, row count for table.
Args:
table_name: Name of the table to describe
Returns:
Formatted string with column info, or error message if table not found
"""
def _handle_sample(self, table_name: str, limit: int = 5) -> str:
"""Execute SELECT * FROM table LIMIT N, return formatted rows.
Args:
table_name: Name of the table to sample
limit: Maximum rows to return (default 5)
Returns:
Formatted string with sample data, or error message if table not found
"""
def _handle_query(self, sql: str) -> str:
"""Validate SELECT-only, execute with timeout, truncate to 20 rows.
Args:
sql: SQL SELECT query to execute
Returns:
Formatted result string, or error message
"""
def _handle_answer(self, value: str) -> tuple[bool, float]:
"""Compare to gold answer, return (correct, reward).
Args:
value: Agent's answer string
Returns:
Tuple of (is_correct, reward_value)
"""
def _build_observation(self) -> SQLObservation:
"""Construct SQLObservation from current episode context.
Returns:
Rich SQLObservation with question, schema, result, error, budget info
"""
Modified Functions
# Location: server/sql_environment.py
# CHANGE: New constructor signature with questions_path, db_dir, step_budget
def __init__(
self,
questions_path: str, # NEW
db_dir: str, # NEW
tokenizer: ModelTokenizer,
step_budget: int = 15, # NEW
):
"""Initialize with question dataset and database paths."""
# Location: server/sql_environment.py
# CHANGE: reset() now picks question, opens DB, computes gold answer
def reset(
self,
*,
seed: int | None = None,
episode_id: str | None = None,
**kwargs,
) -> SQLObservation:
"""Pick random question, open read-only SQLite, return initial observation."""
# Location: server/sql_environment.py
# CHANGE: step() now dispatches structured actions, executes SQL
def step(
self,
action: SQLAction,
*,
timeout_s: float = 30,
**kwargs,
) -> SQLObservation:
"""Dispatch to handler, update episode context, return observation."""
# Location: server/app.py
# CHANGE: Factory passes questions_path and db_dir
def create_sql_environment():
"""Factory function that creates SQLEnvironment with tokenizer and data paths."""
tokenizer = get_tokenizer()
questions_path = os.environ.get(
"QUESTIONS_PATH",
str(Path(__file__).parent.parent / "data" / "questions" / "student_assessment.json"),
)
db_dir = os.environ.get(
"DB_DIR",
str(Path(__file__).parent.parent / "data" / "databases"),
)
return SQLEnvironment(
questions_path=questions_path,
db_dir=db_dir,
tokenizer=tokenizer,
)
API Changes
The HTTP/WebSocket API is defined by OpenEnv's create_app() and does not change structurally. The payload shapes change:
# Endpoint: POST /step
# CHANGE: SQLAction now uses argument instead of action_description
Request:
action_type: str # "DESCRIBE" | "SAMPLE" | "QUERY" | "ANSWER"
argument: str # table name, SQL, or answer value
Response (SQLObservation):
done: bool
reward: float | null
question: str
schema_info: str
result: str
error: str
step_count: int
budget_remaining: int
action_history: list[str]
# Endpoint: POST /reset
# CHANGE: Now returns rich observation with question and schema
Response (SQLObservation):
done: false
reward: null
question: str # The NL question for this episode
schema_info: str # Table names only (columns hidden until DESCRIBE)
result: ""
error: ""
step_count: 0
budget_remaining: 15
action_history: []
4. Data Flow
Primary Flow: Reset
1. Client calls POST /reset
- Input: optional seed, episode_id
2. SQLEnvironment.reset()
- Close previous EpisodeContext.db_connection (if exists)
- Pick random QuestionRecord from loaded questions (using seed if provided)
- Open read-only SQLite via _open_db(question.database_name)
- Execute gold_sql to compute gold_answer
- Create new EpisodeContext (step_count=0, budget=15, done=False)
3. _build_observation()
- Output: SQLObservation with question text, table names as schema_info,
empty result/error, step_count=0, budget_remaining=15, empty action_history
Primary Flow: Step (QUERY)
1. Client calls POST /step with SQLAction(action_type="QUERY", argument="SELECT ...")
- Input: structured action
2. SQLEnvironment.step(action)
- Validate action_type is one of DESCRIBE/SAMPLE/QUERY/ANSWER
- Check episode not done and budget > 0
- Dispatch to _handle_query(sql)
3. _handle_query(sql)
- Validate SQL starts with SELECT (case-insensitive, after stripping)
- Call _execute_sql(sql, timeout_s=5.0)
- Format results as text table, truncate to 20 rows
- Return formatted result string
4. Update EpisodeContext
- step_count += 1
- budget -= 1
- Append action summary to action_log
- If budget == 0: done = True
5. _build_observation()
- Output: SQLObservation with result, updated step_count/budget
Alternative Flows
When action_type is DESCRIBE:
1. _handle_describe(table_name)
2. If table_name not in database tables -> return error string listing available tables
3. Query sqlite_master or PRAGMA table_info for column names/types
4. Add table to described_tables set
5. Return formatted schema string
When action_type is SAMPLE:
1. _handle_sample(table_name, limit=5)
2. If table_name not in database tables -> return error string
3. Execute "SELECT * FROM {table_name} LIMIT 5" via _execute_sql
4. Return formatted rows
When action_type is ANSWER:
1. _handle_answer(value)
2. Compare value to gold_answer (case-insensitive string comparison for MVP)
3. Set done = True
4. Return (is_correct, 1.0 if correct else 0.0)
5. Do NOT decrement budget for ANSWER actions
When budget is exhausted:
1. Budget reaches 0 after step
2. Set done = True, reward = 0.0
3. Return terminal observation with done=True
When SQL is invalid:
1. _handle_query receives non-SELECT SQL
2. Return error: "Only SELECT queries are allowed. Got: {first_word}"
3. Step still counts against budget
When SQL times out:
1. _execute_sql exceeds 5s timeout
2. Interrupt query via progress_handler
3. Return error: "Query timed out after 5.0 seconds"
When step() called after episode is done:
1. Check self._episode.done is True
2. Return current observation unchanged (no state mutation)
5. Error Handling
Error Types
| Error | When | Response | User Message |
|---|---|---|---|
| Invalid action_type | action_type not in {DESCRIBE, SAMPLE, QUERY, ANSWER} | error field in observation | "Unknown action type '{x}'. Valid types: DESCRIBE, SAMPLE, QUERY, ANSWER" |
| Table not found | DESCRIBE/SAMPLE with nonexistent table | error field in observation | "Table '{x}' not found. Available tables: {list}" |
| Non-SELECT SQL | QUERY with INSERT/UPDATE/DELETE/etc. | error field in observation | "Only SELECT queries are allowed. Got: {first_keyword}" |
| SQL syntax error | Invalid SQL | error field in observation | "SQL error: {sqlite3_error_message}" |
| Query timeout | Execution exceeds 5s | error field in observation | "Query timed out after 5.0 seconds" |
| Empty argument | Blank argument field | error field in observation | "Argument cannot be empty for {action_type}" |
| Episode already done | step() after termination | Return current obs | (no error -- observation unchanged, done=True) |
| Database file missing | _open_db can't find .sqlite | FileNotFoundError at reset | "Database '{db_name}' not found in {db_dir}" |
| Questions file missing | _load_questions can't find JSON | FileNotFoundError at init | "Questions file not found: {path}" |
Error Handling Strategy
All action-level errors are returned in the error field of SQLObservation. The environment never raises exceptions from step() -- errors are part of the observation so the agent can learn from them.
Infrastructure errors (missing database files, missing questions file) raise Python exceptions at init/reset time since these are configuration failures, not agent errors.
# Pattern for action handlers:
def _handle_query(self, sql: str) -> str:
sql_stripped = sql.strip()
if not sql_stripped:
return "" # error set in step()
# SELECT-only check
first_word = sql_stripped.split()[0].upper()
if first_word != "SELECT":
return "" # error set in step()
try:
rows = self._execute_sql(sql_stripped)
return self._format_results(rows)
except sqlite3.OperationalError as e:
return "" # error message set in step()
Retry Strategy
| Operation | Retry? | Strategy |
|---|---|---|
| SQL execution | No | Single attempt; timeout kills long queries |
| DB connection open | No | Fail fast at reset(); configuration error |
| Question loading | No | Fail fast at init; file must exist |
6. Slice Plan (What we will ship, in order)
Slice S1 -- Data & Types Foundation
Value: Database files exist, models are updated, environment can be instantiated with new constructor
User-visible change: No (internal foundation)
Interfaces introduced/changed: SQLAction.argument, rich SQLObservation, EpisodeContext, QuestionRecord, new __init__ signature
Rollback safety: Additive -- new fields on models, old code paths not yet removed
Slice S2 -- Core Environment Loop
Value: reset() picks questions and opens databases; step() dispatches to handlers and executes real SQL; episodes run end-to-end
User-visible change: Yes -- the environment is now functional
Interfaces introduced/changed: reset(), step(), all _handle_* methods, _execute_sql, _build_observation
Rollback safety: Replaces existing broken Ollama-based methods; rollback = revert commit
Slice S3 -- Integration & Cleanup
Value: Factory, client, and tests updated; Ollama code removed; environment fully wired
User-visible change: Yes -- complete end-to-end agent interaction works
Interfaces introduced/changed: create_sql_environment() factory, client _parse_result()
Rollback safety: Final cleanup slice; rollback = revert commit
7. Implementation Steps
VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.
Step 1.1: Download Spider SQLite Databases
Slice: S1 Goal: Create a script that downloads the actual Spider SQLite database files so the environment has real data to query.
Files:
scripts/download_spider_databases.py- create - Download script that fetches Spider database .sqlite filesdata/databases/- modified - Will contain downloaded .sqlite files (gitignored)
Interface Changes: None (infrastructure only)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T19:22:08Z Changes Made:
scripts/download_spider_databases.pycreated as a CLI utility to download one Spider SQLite database (--db-id) or all databases (--db-id all) intodata/databases/- Added argument parsing (
--db-id,--output-dir,--force) and reusable download helpers for raw single-file and archive-based bulk download - Added input/path hardening:
db_idvalidation ([A-Za-z0-9_]+) and safe output-path boundary enforcement to prevent path traversal writes
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run python scripts/download_spider_databases.py --help Result: CLI usage printed successfully with expected options Command: uv run python scripts/download_spider_databases.py --db-id "../bad" Result: ValueError raised as expected for invalid db_id Command: uv run pytest tests/ -v Result: 21 passed in 4.73s Reviewer subagent verdict: APPROVE - Tests run:
uv run pytest tests/ -v - Notes:
- Script should download the
student_assessmentdatabase at minimum - Spider databases are typically at
https://github.com/taoyds/spideror HuggingFace - The
student_assessment.sqlitemust match the ORM models indata/databases/models.py
- Script should download the
- Issues: Legacy environment/client/test code still targets removed wire fields (
action_description,messages,tokens); resolved by planned S2/S3 steps. - Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Database file(s) must exist at
data/databases/student_assessment/student_assessment.sqlitebefore reset() can work
Step 1.2: Add QuestionRecord and EpisodeContext to models.py
Slice: S1 Goal: Implement the server-side dataclasses that hold per-episode state and question metadata.
Files:
models.py- modify - AddQuestionRecordandEpisodeContextdataclasses
Interface Changes:
- New
QuestionRecorddataclass - New
EpisodeContextdataclass
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T19:26:22Z Changes Made:
models.pyupdated to addQuestionRecorddataclass with the full 8-field question metadata contract.models.pyupdated to addEpisodeContextdataclass with server-side episode state, including safe mutable defaults fordescribed_tablesandaction_log.- Added dataclass/sqlite imports and aliased dataclass
fieldtodataclass_fieldto avoid conflicts with PydanticField.
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run pytest tests/ -v Result: 21 passed in 4.70s Reviewer subagent verdict: APPROVE - Tests run:
uv run pytest tests/ -v - Notes:
- Keep conceptual comments in models.py for reference but implement the actual dataclasses
EpisodeContext.db_connectionissqlite3.Connection-- not serializable, server-only
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
QuestionRecordandEpisodeContextnow exist as concrete server-side types; proceed to wire-level model rewrite in Step 1.3 (SQLAction.argumentand richSQLObservationfields).
Step 1.3: Rewrite SQLAction and SQLObservation
Slice: S1
Goal: Update wire types to use structured argument field and rich observation fields.
Files:
models.py- modify - RewriteSQLAction(replaceaction_descriptionwithargument, removetokens), uncomment and updateSQLObservationrich fields, removemessages/tokens
Interface Changes:
SQLAction.action_description->SQLAction.argumentSQLAction.tokensremovedSQLObservationgains:question,schema_info,result,error,step_count,budget_remaining,action_historySQLObservation.messagesandSQLObservation.tokensremoved
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Medium
Breaking API change -- client must be updated in S3
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T19:32:08Z Changes Made:
models.pyupdated to replaceSQLAction.action_descriptionwithSQLAction.argument, and removeSQLAction.tokensfrom the wire contract.models.pyupdated to replace legacySQLObservation.messages/tokenspayload shape with rich observation fields:question,schema_info,result,error,step_count,budget_remaining,action_history.models.pyupdatedSQLState.current_action_typedefault/description to align with normalized action vocabulary (DESCRIBE,SAMPLE,QUERY,ANSWER).
Result:
- Outcome: ?? Completed with Issues
- Evidence Captured:
Command: uv run pytest tests/ -v Result: 15 failed, 6 passed in 5.30s Failure pattern: expected legacy contract mismatch in tests/environment/client still using action_description/messages/tokens. This is expected after Step 1.3 wire-model rewrite and will be resolved by the planned S2/S3 environment/client/test rewrites. Reviewer subagent verdict: APPROVE - Tests run:
uv run pytest tests/ -v - Notes:
- This is a breaking change to the wire protocol
- Existing tests fail after this step until Step 2.x/3.x updates environment/client/tests to the new contract
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Wire contracts are now in place; next step is Step 2.1 to rewrite environment constructor and data loading/open-db infrastructure to match the new model interfaces.
Step 2.1: Rewrite SQLEnvironment constructor, _load_questions, _open_db
Slice: S2 Goal: New constructor that accepts questions_path and db_dir, loads questions at init, and provides _open_db for reset.
Files:
server/sql_environment.py- modify - Rewrite__init__, add_load_questions(), add_open_db()
Interface Changes:
SQLEnvironment.__init__(questions_path, db_dir, tokenizer, step_budget)replaces old constructor- New
_load_questions(path) -> list[QuestionRecord] - New
_open_db(db_name) -> sqlite3.Connection
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T19:44:22Z Changes Made:
server/sql_environment.pyconstructor rewritten to requirequestions_path,db_dir,tokenizer, andstep_budget, with validation for missing paths and non-positive step budgets.- Added
_load_questions(path)to parse Spider question JSON intoQuestionRecordvalues with schema-safedb_idvalidation and derivedtables_involvedfromFROM/JOINclauses. - Added
_open_db(db_name)read-only opener usingfile:{path}?mode=ro, with db-name allowlist validation and resolved-path containment checks to prevent path traversal outsidedb_dir. - Removed runtime dependency on Ollama HTTP calls in helper methods to keep this step local and deterministic while Step 2.3 rewires query execution fully.
Result:
- Outcome: ?? Completed with Issues
- Evidence Captured:
Command: uv run ruff check server/sql_environment.py Result: All checks passed Command: uv run pytest tests/ -v Result: 21 failed in 4.97s Failure pattern: expected legacy smoke suite mismatch (tests still assert old constructor and wire contract: system_prompt/action_description/messages/tokens). Reviewer subagent verdict: APPROVE Reviewer notes: previously reported _open_db path-traversal risk resolved via db_name allowlist + resolved path containment checks. - Tests run:
uv run pytest tests/ -v - Notes:
- Remove Ollama config (ollama_model, ollama_base_url)
- Remove
requestsimport - Keep
self.db_modelsdict for_handle_describefallback but preferPRAGMA table_infoon the live SQLite connection _open_dbopens with URIfile:{path}?mode=ro
- Issues: Legacy smoke tests still target pre-S2 interfaces and will be rewritten in Step 3.2.
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Constructor, question loading, and DB opening are ready with path/input guards; proceed to Step 2.2 to implement reset lifecycle and rich observation building.
Step 2.2: Implement reset() and _build_observation()
Slice: S2
Goal: reset() picks a random question, opens the database, computes the gold answer, creates EpisodeContext, and returns the initial observation via _build_observation().
Files:
server/sql_environment.py- modify - Rewritereset(), add_build_observation()
Interface Changes:
reset(*, seed, episode_id, **kwargs) -> SQLObservation_build_observation() -> SQLObservation
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T19:54:13Z Changes Made:
server/sql_environment.pyreset lifecycle rewritten to select a question deterministically with optional seed, close any previous episode connection, open a read-only SQLite DB, compute the question gold answer, and initializeEpisodeContextwith configured budget and optionalepisode_id.- Added
_build_observation()to construct richSQLObservationpayloads from live episode context, including question text, schema table listing, budget/step counters, action history, and progressive described-table column info. - Added reset support helpers
_get_table_names(),_format_gold_answer(), and_execute_gold_sql()(SELECT-only + timeout guarded) plus a temporary_create_observation()wrapper for compatibility until Step 2.3 rewritesstep().
Result:
- Outcome: ?? Completed with Issues
- Evidence Captured:
Command: uv run ruff check server/sql_environment.py Result: All checks passed Command: uv run pytest tests/ -v Result: 21 failed in 4.62s Failure pattern: legacy smoke suite still targets pre-migration interfaces (system_prompt/action_description/messages/tokens) and is expected to be rewritten in Step 3.2. Command: uv run python scripts/download_spider_databases.py --db-id student_assessment Result: RuntimeError due to upstream Spider raw URL 404 in current downloader. Reviewer subagent verdict: APPROVE (Step 2.2 scope) - Tests run:
uv run pytest tests/ -v - Notes:
- Initial
schema_infonow lists table names only; described table column details are appended progressively. - Question selection uses
random.Random(seed)when a seed is supplied for deterministic reset behavior. - Gold answer computation now runs through a timeout-protected, SELECT-only SQL path (
_execute_gold_sql) on the read-only connection.
- Initial
- Issues: Local workspace currently lacks Spider SQLite fixtures; full reset runtime validation depends on Step 1.1 data download script path fix or local DB provisioning.
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- reset() works; step() handlers can now be implemented
Step 2.3: Implement _execute_sql and action handlers, rewrite step()
Slice: S2 Goal: Implement sandboxed SQL execution and all four action handlers. Rewrite step() to dispatch structured actions, enforce budget, and handle episode termination.
Files:
server/sql_environment.py- modify - Add_execute_sql(),_handle_describe(),_handle_sample(),_handle_query(),_handle_answer(). Rewritestep(). Remove_call_ollama_to_select_table(),_call_ollama_for_sql(),_detect_action_type(),_generate_sample_query(),_create_observation().
Interface Changes:
step(action, *, timeout_s, **kwargs) -> SQLObservation_execute_sql(sql, timeout_s) -> list[tuple]_handle_describe(table_name) -> str_handle_sample(table_name, limit) -> str_handle_query(sql) -> str_handle_answer(value) -> tuple[bool, float]
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Medium
Processes untrusted SQL input; must enforce read-only + SELECT-only + timeout
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T21:10:26Z Changes Made:
server/sql_environment.pyrewritten to implement_execute_sql(SELECT-only validation, single-statement guard, SQLite progress-handler timeout, 20-row truncation) and all structured handlers (_handle_describe,_handle_sample,_handle_query,_handle_answer).server/sql_environment.pystep()rewritten to dispatch onDESCRIBE/SAMPLE/QUERY/ANSWER, return observation-level errors instead of raising, enforce step budget/termination, and keepANSWERas non-budget-consuming on valid submissions.- Removed legacy Ollama-era action helpers (
_call_ollama_to_select_table,_call_ollama_for_sql,_generate_sample_query,_detect_action_type,_create_observation) and convertedmessage_to_action()into a thin structured-action adapter. - Applied reviewer-requested hardening: invalid action types and empty arguments now consume budget/step count to prevent malformed-action budget bypass loops.
Result:
- Outcome: ?? Completed with Issues
- Evidence Captured:
Command: uv run ruff check server/sql_environment.py Result: All checks passed Command: uv run pytest tests/ -v Result: 21 failed in 6.42s Failure pattern: legacy smoke suite still asserts pre-migration interfaces (system_prompt constructor, action_description/messages/tokens contract). Reviewer subagent verdict: REQUEST_CHANGES Reviewer finding addressed: malformed-action budget bypass fixed by charging invalid action type / empty-argument attempts against budget. - Tests run:
uv run pytest tests/ -v - Notes:
_execute_sqlshould useconnection.set_progress_handler(callback, N)for timeout- SELECT-only validation: strip, split on whitespace, check first token is SELECT
_handle_describe: usePRAGMA table_info(table_name)on live connection_handle_sample:SELECT * FROM {table_name} LIMIT {limit}via_execute_sql_handle_query: validate SELECT-only, execute, format, truncate to 20 rows_handle_answer: simple string comparison (case-insensitive, stripped) for MVP- Budget decrement on DESCRIBE, SAMPLE, QUERY only (not ANSWER)
- Refactor or remove
message_to_action()-- keep as thin adapter if OpenEnv requires it
- Issues: Legacy smoke tests remain out-of-date with post-1.3/2.x contracts and will be rewritten in Step 3.2.
- Follow-ups Created: None
- Human Review Completed: N/A (Medium risk but sandboxing is well-specified)
Context for Next Step:
- Environment core loop now executes structured actions against SQLite with sandboxing; proceed to Step 3.1 to wire the new constructor and observation contract through
server/app.pyandclient.py.
Step 3.1: Update app.py factory and client.py
Slice: S3 Goal: Wire the new constructor signature into the factory and update the client to handle rich observations.
Files:
server/app.py- modify - Updatecreate_sql_environment()to passquestions_path,db_dirclient.py- modify - Update_parse_result()for newSQLObservationfields
Interface Changes:
create_sql_environment()passes new constructor params- Client handles
question,schema_info,result,error,step_count,budget_remaining,action_historyfields
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T21:17:18Z Changes Made:
server/app.pyupdatedcreate_sql_environment()to remove legacySYSTEM_PROMPTwiring and passquestions_path(QUESTIONS_PATHenv var with project default) anddb_dir(DB_DIRenv var with project default) intoSQLEnvironment.client.pyupdated_step_payload()to send the structured wire contract (action_type,argument) instead of legacyaction_description/tokens.client.pyupdated_parse_result()to deserialize richSQLObservationfields (question,schema_info,result,error,step_count,budget_remaining,action_history) with robust fallback whenobservationis absent.client.pymessage_to_action()updated to emit structuredSQLAction(action_type, argument)and support explicit prefixed actions (DESCRIBE,SAMPLE,QUERY,ANSWER).
Result:
- Outcome: ?? Completed with Issues
- Evidence Captured:
Command: uv run ruff check server/app.py client.py Result: All checks passed Command: uv run pytest tests/ -v Result: 21 failed in 6.62s Failure pattern: legacy smoke suite still asserts pre-migration contracts (system_prompt constructor, action_description/messages/tokens fields). Reviewer subagent verdict: APPROVE Reviewer note: targeted contract checks for step payload parsing and app factory wiring passed for Step 3.1 scope. - Tests run:
uv run ruff check server/app.py client.py,uv run pytest tests/ -v - Notes:
- Use env vars
QUESTIONS_PATHandDB_DIRwith sensible defaults - Remove
system_promptenv var from factory (no longer needed)
- Use env vars
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Rewrite
tests/test_smoke.pyfor the structured action/observation contract and new environment constructor to clear the current legacy-suite failures.
Step 3.2: Rewrite tests
Slice: S3 Goal: Update test_smoke.py for structured actions, real SQL execution, and rich observations. Remove tests for Ollama-based methods.
Files:
tests/test_smoke.py- modify - Rewrite test classes for new interface
Interface Changes: None (tests only)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-24T21:27:31Z Changes Made:
tests/test_smoke.pyfully rewritten from legacy chat/token contract tests to structured action loop coverage forSQLAction.argumentand richSQLObservationfields.- Added deterministic temp SQLite + questions fixtures used by environment lifecycle tests (reset, DESCRIBE, SAMPLE, QUERY, ANSWER, budget exhaustion, post-terminal behavior).
- Added sandbox behavior assertions for SELECT-only rejection, query truncation to 20 rows, timeout-path error handling, and read-only DB enforcement.
- Updated client-contract tests for
_step_payload(),_parse_result(),_parse_state(), and clientmessage_to_action()inference.
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run pytest tests/ -v Result: 25 passed in 6.49s Coverage notes: - Structured action contract (DESCRIBE/SAMPLE/QUERY/ANSWER) validated - Rich observation fields validated on reset/step - SQL sandbox guards covered (non-SELECT rejection, timeout path, read-only) - Step budget and terminal behavior covered - Tests run:
uv run pytest tests/ -v - Notes:
- Replaced legacy Ollama/message-token assumptions with tests aligned to the current environment architecture
- Tests use local temporary fixtures and do not require Spider database downloads
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- All implementation steps complete and verified; ready for commit/push/PR workflow
8. Rollout Considerations
Feature Flags
- Required: No
- Flag name: N/A
Migration
- Data migration needed: No
- Migration strategy: N/A
The Spider database download is a one-time setup step via scripts/download_spider_databases.py.
Rollback Plan
Revert the feature branch. The environment returns to the Ollama-based non-functional state. No data migration is involved.
9. Execution Tracking
All execution state is tracked within this document:
- Section 1a: Overall progress summary
- Section 7: Per-step completion details, test results, and handoff context
- FEATURES.json: Feature-level status/progress metadata used by
/autocode-next-stepandopencode-ctx ralph run - Git history: Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's
progressandstatusfields inFEATURES.json - Running
git log --oneline IMPLEMENTATION_SPEC.mdfor change history
9a. Slice Completion Protocol
After all steps in a slice pass verification:
Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete
Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.
10. User Value Summary
Status: Generated
What Users Can Now Do
Agents can now run full SQL exploration episodes end-to-end: reset into a real question/database pair, inspect schema with DESCRIBE/SAMPLE, execute SELECT queries safely, and submit terminal ANSWER actions for reward.
How to Access/Test
Run uv run pytest tests/ -v for automated coverage, or start the environment with uv run uvicorn server.app:app --reload and call /reset then /step using structured actions (DESCRIBE, SAMPLE, QUERY, ANSWER).
Demo
- Command:
uv run pytest tests/ -v
Release Notes Snippet
Implemented the core SQL environment loop with structured actions, live read-only SQLite execution, step-budget termination, and updated client/test contracts.
11. PR Contract (Auto-Generated by autocode-next-step)
Status: Generated
Scope
- Complete F001 core environment loop migration from Ollama-driven behavior to deterministic structured SQL execution.
- Include model, server loop, app/client wiring, and rewritten smoke coverage aligned to the new wire contract.
Verification
uv run pytest tests/ -v-> 25 passed, 0 failed.- Verification mode:
standard.
Risk / Rollback
- Risk tier: Medium (untrusted SQL input), mitigated via read-only DB mode, SELECT-only validation, and timeout enforcement.
- Rollback: revert feature branch commits for F001.
Ready For
/commit-push-pr
PR Created
Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than 3 files in unrelated areas
- You need to introduce multiple new abstractions "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
Human Checkpoint
Before handing to AI agent:
- Interface specifications are complete
- Data flow is accurate
- Error handling is specified
- Implementation order makes sense
- VERIFICATION_SPEC.md has been generated
Questions:
- Any remaining concerns?
- Anything agent should know?
Handoff Notes
For the implementing AI agent:
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Specification completed: 2026-03-24 Approved by: [NAME/ROLE] Verification spec: VERIFICATION_SPEC.md Target agent: Claude Code