sql_env / specs /F003-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F003 Generated from: specs/F003-VERIFICATION_INPUT.json Generated: 2026-03-27


1. Unit Tests

EpisodeContext (Type Extension)

Test Description Input Expected Category
test_episode_context_has_gold_rows New field exists and defaults EpisodeContext(...) gold_rows is [] happy
test_episode_context_has_query_hashes New field exists and defaults EpisodeContext(...) query_hashes is set() happy
test_episode_context_has_best_progress New field exists and defaults EpisodeContext(...) best_progress is 0.0 happy
test_episode_context_has_cumulative_step_reward New field exists and defaults EpisodeContext(...) cumulative_step_reward is 0.0 happy
test_episode_context_has_cumulative_new_info_reward New field exists and defaults EpisodeContext(...) cumulative_new_info_reward is 0.0 happy
test_episode_context_gold_rows_accepts_tuples Field stores tuple list gold_rows=[(1, "a"), (2, "b")] Stored correctly happy

Run: uv run pytest tests/unit/test_reward.py -v -k "EpisodeContext"


_cardinality_score

Test Description Input Expected Category
test_cardinality_exact_match Same row count pred=[(1,),(2,)], gold=[(3,),(4,)] 1.0 happy
test_cardinality_zero_pred Empty prediction pred=[], gold=[(1,)] 0.0 edge
test_cardinality_zero_gold Empty gold pred=[(1,)], gold=[] 0.0 edge
test_cardinality_both_empty Both empty pred=[], gold=[] 1.0 (0/max(0,0,1)=0, 1-0=1) edge
test_cardinality_pred_larger More pred rows pred=[(i,) for i in range(10)], gold=[(1,)] 0.1 (1-9/10) boundary
test_cardinality_gold_larger More gold rows pred=[(1,)], gold=[(i,) for i in range(4)] 0.25 (1-3/4) boundary
test_cardinality_returns_float_in_range Any input Various Result in [0.0, 1.0] invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "cardinality"


_value_overlap_score

Test Description Input Expected Category
test_value_overlap_identical Same rows pred=[(1,"a")], gold=[(1,"a")] 1.0 happy
test_value_overlap_disjoint No shared values pred=[(1,"x")], gold=[(2,"y")] 0.0 edge
test_value_overlap_partial Some overlap pred=[(1,"a"),(2,"b")], gold=[(1,"a"),(3,"c")] Jaccard of {"1","a","2","b"} vs {"1","a","3","c"} = 2/6 ~ 0.333 happy
test_value_overlap_empty_pred No pred rows pred=[], gold=[(1,)] 0.0 edge
test_value_overlap_empty_gold No gold rows pred=[(1,)], gold=[] 0.0 edge
test_value_overlap_both_empty Both empty pred=[], gold=[] 0.0 (empty Jaccard) or 1.0 (convention) edge
test_value_overlap_stringifies_values Mixed types pred=[(1, 2.5, None)], gold=[(1, 2.5, None)] 1.0 (all stringify to same) edge
test_value_overlap_returns_float_in_range Any input Various Result in [0.0, 1.0] invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "value_overlap"


_numeric_range_score

Test Description Input Expected Category
test_numeric_range_identical Same numbers pred=[(10,)], gold=[(10,)] 1.0 happy
test_numeric_range_no_numerics_in_gold Only strings in gold pred=[("a",)], gold=[("b",)] 1.0 (spec: returns 1.0 if no numerics in gold) edge
test_numeric_range_close_values Near match pred=[(11,)], gold=[(10,)] Close to 1.0 (1/(1+log(1+1)) ~ 0.59) happy
test_numeric_range_far_values Very different pred=[(1000000,)], gold=[(1,)] Near 0.0 boundary
test_numeric_range_zero_distance Exact match numerics pred=[(0,)], gold=[(0,)] 1.0 (1/(1+log(1+0))=1) edge
test_numeric_range_negative_numbers Negative values pred=[(-5,)], gold=[(5,)] Uses absolute difference ` (-5)-5
test_numeric_range_mixed_types Some numeric some not pred=[(10,"a")], gold=[(10,"b")] Score based only on numeric columns edge
test_numeric_range_empty_pred No pred rows pred=[], gold=[(1,)] Gracefully handle, likely 0.0 edge
test_numeric_range_returns_float_in_range Any input Various Result in [0.0, 1.0] invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "numeric_range"


_bin_progress

Test Description Input Expected Category
test_bin_progress_zero Score 0.0 0.0 0.0 (below 0.125) boundary
test_bin_progress_low Score 0.124 0.124 0.0 boundary
test_bin_progress_boundary_0125 Score exactly 0.125 0.125 0.25 boundary
test_bin_progress_mid_low Score 0.3 0.3 0.25 (between 0.125 and 0.375) happy
test_bin_progress_boundary_0375 Score exactly 0.375 0.375 0.5 boundary
test_bin_progress_mid Score 0.5 0.5 0.5 (between 0.375 and 0.625) happy
test_bin_progress_boundary_0625 Score exactly 0.625 0.625 0.75 boundary
test_bin_progress_mid_high Score 0.7 0.7 0.75 happy
test_bin_progress_boundary_0875 Score exactly 0.875 0.875 1.0 boundary
test_bin_progress_one Score 1.0 1.0 1.0 boundary

Run: uv run pytest tests/unit/test_reward.py -v -k "bin_progress"


_layer1_operational

Test Description Input Expected Category
test_layer1_successful_query exec_ok + step_cost action_type="QUERY", rows=[(1,)], error=None, new sql +0.02 - 0.005 = +0.015 (plus possible new_info) happy
test_layer1_successful_describe exec_ok + step_cost action_type="DESCRIBE", rows=..., error=None +0.02 - 0.005 = +0.015 happy
test_layer1_successful_sample exec_ok + step_cost action_type="SAMPLE", rows=..., error=None +0.02 - 0.005 = +0.015 happy
test_layer1_error_query step_cost only error="some error", rows=None -0.005 error
test_layer1_new_info_reward First unique SQL new sql hash, rows not None Includes +0.01 new_info happy
test_layer1_new_info_capped Cap at 0.10 Execute 11+ unique queries cumulative_new_info_reward does not exceed 0.10 boundary
test_layer1_repeat_penalty Same SQL twice Submit same SQL hash twice Second call includes -0.01 repeat error
test_layer1_repeat_no_exec_ok Repeated query skips exec_ok Same SQL hash as before No +0.02 bonus edge
test_layer1_step_cost_always_applied Step cost on every call Any action Always includes -0.005 invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "layer1"


_layer2_progress

Test Description Input Expected Category
test_layer2_perfect_match All sub-metrics = 1.0 rows == gold_rows (exact match) Binned 1.0, improvement from 0 = 1.0, scaled by 0.15 = 0.15 happy
test_layer2_no_improvement Same binned score as best Second identical query 0.0 (no improvement over best_progress) edge
test_layer2_improvement_only New bin > best First query close, second closer Reward = (new_bin - best_progress) * 0.15 happy
test_layer2_empty_gold_rows Gold is empty ctx.gold_rows = [] 0.0 edge
test_layer2_weighted_average Check weight formula Known sub-metric values 0.25*card + 0.50*overlap + 0.25*numeric happy
test_layer2_updates_best_progress Mutates ctx Query improves progress ctx.best_progress updated to new bin happy
test_layer2_does_not_downgrade_best Worse query after good Good query then bad query ctx.best_progress stays at higher value edge

Run: uv run pytest tests/unit/test_reward.py -v -k "layer2"


compute_step_reward

Test Description Input Expected Category
test_compute_reward_query_success Layer 1 + Layer 2 combined QUERY with valid rows, gold_rows set Sum of L1 + L2, clamped happy
test_compute_reward_query_error Layer 1 only, no Layer 2 QUERY with error -0.005 (step_cost only) error
test_compute_reward_describe Layer 1 only, no Layer 2 DESCRIBE action L1 signal only happy
test_compute_reward_sample Layer 1 only, no Layer 2 SAMPLE action L1 signal only happy
test_compute_reward_clamp_upper Cumulative capped at +0.5 Many successful improving queries Cumulative never exceeds +0.5 boundary
test_compute_reward_clamp_lower Cumulative floored at -0.2 Many errors in a row Cumulative never goes below -0.2 boundary
test_compute_reward_clamp_returns_delta Step reward reflects clamp Cumulative at 0.49, next step would add 0.05 Returns 0.01 (clamped to 0.5) boundary
test_compute_reward_mutates_ctx Updates tracking fields Any call ctx.cumulative_step_reward updated happy
test_compute_reward_layer2_skipped_for_describe No progress calc for non-QUERY DESCRIBE with rows Layer 2 not called happy
test_compute_reward_layer2_skipped_when_rows_none No progress calc on error QUERY, rows=None Layer 2 not called edge
test_compute_reward_layer2_skipped_empty_gold No progress with empty gold QUERY, gold_rows=[] Layer 2 returns 0.0 edge

Run: uv run pytest tests/unit/test_reward.py -v -k "compute_step_reward"


2. Integration Tests

Flow: Primary Reward Computation Through step()

Step Action Expected Verification
1 env.reset(seed=42) Episode created, gold_rows populated from gold SQL ctx.gold_rows is non-empty list of tuples
2 env.step(DESCRIBE employees) Step reward from Layer 1 only observation.reward is None (non-terminal), but internal reward tracked
3 env.step(QUERY "SELECT COUNT(*) FROM employees") Layer 1 + Layer 2 computed Progress score reflects cardinality/value/numeric comparison to gold
4 env.step(QUERY same_sql_again) Repeat penalty applied Lower reward than step 3
5 env.step(ANSWER correct_value) Terminal reward = 1.0 observation.done=True, observation.reward=1.0

Run: uv run pytest tests/integration/test_reward_flow.py -v


Flow: SQL Error Handling

Step Action Expected Verification
1 env.reset(seed=42) Episode active Episode context initialized
2 env.step(QUERY "SELECT nonexistent FROM employees") Error caught, step_cost only Reward is -0.005, Layer 2 not computed
3 env.step(QUERY valid_query) Normal reward resumes Layer 1 + Layer 2 computed normally

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "error"


Flow: Empty Gold Rows

Step Action Expected Verification
1 Reset with question whose gold SQL returns empty ctx.gold_rows == [] gold_rows stored as empty list
2 env.step(QUERY any_query) Layer 1 operates, Layer 2 returns 0.0 Reward is Layer 1 signal only

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "empty_gold"


Flow: Repeated Query Detection

Step Action Expected Verification
1 env.reset(seed=42) Fresh episode ctx.query_hashes is empty
2 env.step(QUERY "SELECT 1") Hash added, no repeat penalty ctx.query_hashes has 1 entry
3 env.step(QUERY "SELECT 1") Same hash detected, repeat penalty Reward includes -0.01, no exec_ok
4 env.step(QUERY "SELECT 2") New hash, no repeat penalty Normal reward, ctx.query_hashes has 2 entries

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "repeat"


3. API Tests

No API endpoints defined for F003. The reward system is internal server-side logic.


4. E2E Tests

Scenario: Random Exploration Yields ~0.1 Cumulative Reward

Setup: Environment reset with a known question. Actions: Execute 10 random DESCRIBE/SAMPLE/QUERY actions (no targeted queries). Expected: Cumulative step reward is approximately 0.1 (within [0.0, 0.2]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "random_exploration"


Scenario: Targeted Queries Yield ~0.3 Cumulative Reward

Setup: Environment reset with a known question. Actions: Execute targeted queries that progressively approach the gold answer. Expected: Cumulative step reward is approximately 0.3 (within [0.2, 0.5]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "targeted_queries"


Scenario: Correct Answer Yields ~1.3 Total Reward

Setup: Environment reset with a known question. Actions: Execute targeted queries, then ANSWER correctly. Expected: Total reward (cumulative step + terminal 1.0) is approximately 1.3 (within [1.0, 1.5]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "correct_answer"


5. Edge Cases Checklist

  • Null/None rows passed to compute_step_reward (SQL error case)
  • Empty result rows from a valid query (e.g., SELECT * FROM t WHERE 1=0)
  • Single-row gold vs multi-row prediction
  • Multi-row gold vs single-row prediction
  • Gold rows with only non-numeric values (numeric_range returns 1.0)
  • Gold rows with mixed numeric and string columns
  • Very large numeric values (boundary for log-distance formula)
  • Negative numeric values in gold or prediction
  • Float vs integer comparison in numeric range (e.g., 10 vs 10.0)
  • None/NULL values in result tuples (stringification for value_overlap)
  • SQL strings that differ only by whitespace (hash should differ or normalize)
  • Cumulative new_info exactly at cap (0.10) -- next unique query gets 0
  • Cumulative step reward exactly at clamp boundary (-0.2 or +0.5)
  • Layer 2 called with pred_rows and gold_rows of different column counts
  • _bin_progress with values outside [0, 1] (e.g., negative or > 1.0 from rounding)
  • Concurrent episodes (if supported) -- each has independent tracking fields

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output uv run pytest tests/unit/test_reward.py -v shows X passed
Integration pytest output uv run pytest tests/integration/test_reward_flow.py -v shows X passed
E2E pytest output uv run pytest tests/e2e/test_reward_scenarios.py -v shows X passed
Reward calibration Logged values Random exploration ~0.1, targeted ~0.3, correct ~1.3
Existing tests pytest output uv run pytest tests/test_smoke.py -v still passes (no regressions)