Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F003-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 22 days ago

preview code

raw

history blame contribute delete

15.2 kB

Verification Specification

Feature: F003 Generated from: specs/F003-VERIFICATION_INPUT.json Generated: 2026-03-27

1. Unit Tests

EpisodeContext (Type Extension)

Test	Description	Input	Expected	Category
test_episode_context_has_gold_rows	New field exists and defaults	`EpisodeContext(...)`	`gold_rows` is `[]`	happy
test_episode_context_has_query_hashes	New field exists and defaults	`EpisodeContext(...)`	`query_hashes` is `set()`	happy
test_episode_context_has_best_progress	New field exists and defaults	`EpisodeContext(...)`	`best_progress` is `0.0`	happy
test_episode_context_has_cumulative_step_reward	New field exists and defaults	`EpisodeContext(...)`	`cumulative_step_reward` is `0.0`	happy
test_episode_context_has_cumulative_new_info_reward	New field exists and defaults	`EpisodeContext(...)`	`cumulative_new_info_reward` is `0.0`	happy
test_episode_context_gold_rows_accepts_tuples	Field stores tuple list	`gold_rows=[(1, "a"), (2, "b")]`	Stored correctly	happy

Run: uv run pytest tests/unit/test_reward.py -v -k "EpisodeContext"

_cardinality_score

Test	Description	Input	Expected	Category
test_cardinality_exact_match	Same row count	`pred=[(1,),(2,)], gold=[(3,),(4,)]`	`1.0`	happy
test_cardinality_zero_pred	Empty prediction	`pred=[], gold=[(1,)]`	`0.0`	edge
test_cardinality_zero_gold	Empty gold	`pred=[(1,)], gold=[]`	`0.0`	edge
test_cardinality_both_empty	Both empty	`pred=[], gold=[]`	`1.0` (0/max(0,0,1)=0, 1-0=1)	edge
test_cardinality_pred_larger	More pred rows	`pred=[(i,) for i in range(10)], gold=[(1,)]`	`0.1` (1-9/10)	boundary
test_cardinality_gold_larger	More gold rows	`pred=[(1,)], gold=[(i,) for i in range(4)]`	`0.25` (1-3/4)	boundary
test_cardinality_returns_float_in_range	Any input	Various	Result in `[0.0, 1.0]`	invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "cardinality"

_value_overlap_score

Test	Description	Input	Expected	Category
test_value_overlap_identical	Same rows	`pred=[(1,"a")], gold=[(1,"a")]`	`1.0`	happy
test_value_overlap_disjoint	No shared values	`pred=[(1,"x")], gold=[(2,"y")]`	`0.0`	edge
test_value_overlap_partial	Some overlap	`pred=[(1,"a"),(2,"b")], gold=[(1,"a"),(3,"c")]`	Jaccard of `{"1","a","2","b"}` vs `{"1","a","3","c"}` = 2/6 ~ 0.333	happy
test_value_overlap_empty_pred	No pred rows	`pred=[], gold=[(1,)]`	`0.0`	edge
test_value_overlap_empty_gold	No gold rows	`pred=[(1,)], gold=[]`	`0.0`	edge
test_value_overlap_both_empty	Both empty	`pred=[], gold=[]`	`0.0` (empty Jaccard) or `1.0` (convention)	edge
test_value_overlap_stringifies_values	Mixed types	`pred=[(1, 2.5, None)], gold=[(1, 2.5, None)]`	`1.0` (all stringify to same)	edge
test_value_overlap_returns_float_in_range	Any input	Various	Result in `[0.0, 1.0]`	invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "value_overlap"

_numeric_range_score

Test	Description	Input	Expected	Category
test_numeric_range_identical	Same numbers	`pred=[(10,)], gold=[(10,)]`	`1.0`	happy
test_numeric_range_no_numerics_in_gold	Only strings in gold	`pred=[("a",)], gold=[("b",)]`	`1.0` (spec: returns 1.0 if no numerics in gold)	edge
test_numeric_range_close_values	Near match	`pred=[(11,)], gold=[(10,)]`	Close to 1.0 (1/(1+log(1+1)) ~ 0.59)	happy
test_numeric_range_far_values	Very different	`pred=[(1000000,)], gold=[(1,)]`	Near 0.0	boundary
test_numeric_range_zero_distance	Exact match numerics	`pred=[(0,)], gold=[(0,)]`	`1.0` (1/(1+log(1+0))=1)	edge
test_numeric_range_negative_numbers	Negative values	`pred=[(-5,)], gold=[(5,)]`	Uses absolute difference `	(-5)-5
test_numeric_range_mixed_types	Some numeric some not	`pred=[(10,"a")], gold=[(10,"b")]`	Score based only on numeric columns	edge
test_numeric_range_empty_pred	No pred rows	`pred=[], gold=[(1,)]`	Gracefully handle, likely `0.0`	edge
test_numeric_range_returns_float_in_range	Any input	Various	Result in `[0.0, 1.0]`	invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "numeric_range"

_bin_progress

Test	Description	Input	Expected	Category
test_bin_progress_zero	Score 0.0	`0.0`	`0.0` (below 0.125)	boundary
test_bin_progress_low	Score 0.124	`0.124`	`0.0`	boundary
test_bin_progress_boundary_0125	Score exactly 0.125	`0.125`	`0.25`	boundary
test_bin_progress_mid_low	Score 0.3	`0.3`	`0.25` (between 0.125 and 0.375)	happy
test_bin_progress_boundary_0375	Score exactly 0.375	`0.375`	`0.5`	boundary
test_bin_progress_mid	Score 0.5	`0.5`	`0.5` (between 0.375 and 0.625)	happy
test_bin_progress_boundary_0625	Score exactly 0.625	`0.625`	`0.75`	boundary
test_bin_progress_mid_high	Score 0.7	`0.7`	`0.75`	happy
test_bin_progress_boundary_0875	Score exactly 0.875	`0.875`	`1.0`	boundary
test_bin_progress_one	Score 1.0	`1.0`	`1.0`	boundary

Run: uv run pytest tests/unit/test_reward.py -v -k "bin_progress"

_layer1_operational

Test	Description	Input	Expected	Category
test_layer1_successful_query	exec_ok + step_cost	`action_type="QUERY", rows=[(1,)], error=None, new sql`	`+0.02 - 0.005 = +0.015` (plus possible new_info)	happy
test_layer1_successful_describe	exec_ok + step_cost	`action_type="DESCRIBE", rows=..., error=None`	`+0.02 - 0.005 = +0.015`	happy
test_layer1_successful_sample	exec_ok + step_cost	`action_type="SAMPLE", rows=..., error=None`	`+0.02 - 0.005 = +0.015`	happy
test_layer1_error_query	step_cost only	`error="some error", rows=None`	`-0.005`	error
test_layer1_new_info_reward	First unique SQL	`new sql hash, rows not None`	Includes `+0.01` new_info	happy
test_layer1_new_info_capped	Cap at 0.10	Execute 11+ unique queries	`cumulative_new_info_reward` does not exceed `0.10`	boundary
test_layer1_repeat_penalty	Same SQL twice	Submit same SQL hash twice	Second call includes `-0.01` repeat	error
test_layer1_repeat_no_exec_ok	Repeated query skips exec_ok	Same SQL hash as before	No `+0.02` bonus	edge
test_layer1_step_cost_always_applied	Step cost on every call	Any action	Always includes `-0.005`	invariant

Run: uv run pytest tests/unit/test_reward.py -v -k "layer1"

_layer2_progress

Test	Description	Input	Expected	Category
test_layer2_perfect_match	All sub-metrics = 1.0	`rows == gold_rows` (exact match)	Binned 1.0, improvement from 0 = 1.0, scaled by 0.15 = `0.15`	happy
test_layer2_no_improvement	Same binned score as best	Second identical query	`0.0` (no improvement over best_progress)	edge
test_layer2_improvement_only	New bin > best	First query close, second closer	Reward = `(new_bin - best_progress) * 0.15`	happy
test_layer2_empty_gold_rows	Gold is empty	`ctx.gold_rows = []`	`0.0`	edge
test_layer2_weighted_average	Check weight formula	Known sub-metric values	`0.25card + 0.50overlap + 0.25*numeric`	happy
test_layer2_updates_best_progress	Mutates ctx	Query improves progress	`ctx.best_progress` updated to new bin	happy
test_layer2_does_not_downgrade_best	Worse query after good	Good query then bad query	`ctx.best_progress` stays at higher value	edge

Run: uv run pytest tests/unit/test_reward.py -v -k "layer2"

compute_step_reward

Test	Description	Input	Expected	Category
test_compute_reward_query_success	Layer 1 + Layer 2 combined	QUERY with valid rows, gold_rows set	Sum of L1 + L2, clamped	happy
test_compute_reward_query_error	Layer 1 only, no Layer 2	QUERY with error	`-0.005` (step_cost only)	error
test_compute_reward_describe	Layer 1 only, no Layer 2	DESCRIBE action	L1 signal only	happy
test_compute_reward_sample	Layer 1 only, no Layer 2	SAMPLE action	L1 signal only	happy
test_compute_reward_clamp_upper	Cumulative capped at +0.5	Many successful improving queries	Cumulative never exceeds `+0.5`	boundary
test_compute_reward_clamp_lower	Cumulative floored at -0.2	Many errors in a row	Cumulative never goes below `-0.2`	boundary
test_compute_reward_clamp_returns_delta	Step reward reflects clamp	Cumulative at 0.49, next step would add 0.05	Returns `0.01` (clamped to 0.5)	boundary
test_compute_reward_mutates_ctx	Updates tracking fields	Any call	`ctx.cumulative_step_reward` updated	happy
test_compute_reward_layer2_skipped_for_describe	No progress calc for non-QUERY	DESCRIBE with rows	Layer 2 not called	happy
test_compute_reward_layer2_skipped_when_rows_none	No progress calc on error	QUERY, rows=None	Layer 2 not called	edge
test_compute_reward_layer2_skipped_empty_gold	No progress with empty gold	QUERY, gold_rows=[]	Layer 2 returns 0.0	edge

Run: uv run pytest tests/unit/test_reward.py -v -k "compute_step_reward"

2. Integration Tests

Flow: Primary Reward Computation Through step()

Step	Action	Expected	Verification
1	`env.reset(seed=42)`	Episode created, `gold_rows` populated from gold SQL	`ctx.gold_rows` is non-empty list of tuples
2	`env.step(DESCRIBE employees)`	Step reward from Layer 1 only	`observation.reward` is None (non-terminal), but internal reward tracked
3	`env.step(QUERY "SELECT COUNT(*) FROM employees")`	Layer 1 + Layer 2 computed	Progress score reflects cardinality/value/numeric comparison to gold
4	`env.step(QUERY same_sql_again)`	Repeat penalty applied	Lower reward than step 3
5	`env.step(ANSWER correct_value)`	Terminal reward = 1.0	`observation.done=True, observation.reward=1.0`

Run: uv run pytest tests/integration/test_reward_flow.py -v

Flow: SQL Error Handling

Step	Action	Expected	Verification
1	`env.reset(seed=42)`	Episode active	Episode context initialized
2	`env.step(QUERY "SELECT nonexistent FROM employees")`	Error caught, step_cost only	Reward is `-0.005`, Layer 2 not computed
3	`env.step(QUERY valid_query)`	Normal reward resumes	Layer 1 + Layer 2 computed normally

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "error"

Flow: Empty Gold Rows

Step	Action	Expected	Verification
1	Reset with question whose gold SQL returns empty	`ctx.gold_rows == []`	gold_rows stored as empty list
2	`env.step(QUERY any_query)`	Layer 1 operates, Layer 2 returns 0.0	Reward is Layer 1 signal only

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "empty_gold"

Flow: Repeated Query Detection

Step	Action	Expected	Verification
1	`env.reset(seed=42)`	Fresh episode	`ctx.query_hashes` is empty
2	`env.step(QUERY "SELECT 1")`	Hash added, no repeat penalty	`ctx.query_hashes` has 1 entry
3	`env.step(QUERY "SELECT 1")`	Same hash detected, repeat penalty	Reward includes `-0.01`, no exec_ok
4	`env.step(QUERY "SELECT 2")`	New hash, no repeat penalty	Normal reward, `ctx.query_hashes` has 2 entries

Run: uv run pytest tests/integration/test_reward_flow.py -v -k "repeat"

3. API Tests

No API endpoints defined for F003. The reward system is internal server-side logic.

4. E2E Tests

Scenario: Random Exploration Yields ~0.1 Cumulative Reward

Setup: Environment reset with a known question. Actions: Execute 10 random DESCRIBE/SAMPLE/QUERY actions (no targeted queries). Expected: Cumulative step reward is approximately 0.1 (within [0.0, 0.2]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "random_exploration"

Scenario: Targeted Queries Yield ~0.3 Cumulative Reward

Setup: Environment reset with a known question. Actions: Execute targeted queries that progressively approach the gold answer. Expected: Cumulative step reward is approximately 0.3 (within [0.2, 0.5]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "targeted_queries"

Scenario: Correct Answer Yields ~1.3 Total Reward

Setup: Environment reset with a known question. Actions: Execute targeted queries, then ANSWER correctly. Expected: Total reward (cumulative step + terminal 1.0) is approximately 1.3 (within [1.0, 1.5]).

Run: uv run pytest tests/e2e/test_reward_scenarios.py -v -k "correct_answer"

5. Edge Cases Checklist

Null/None rows passed to compute_step_reward (SQL error case)
Empty result rows from a valid query (e.g., SELECT * FROM t WHERE 1=0)
Single-row gold vs multi-row prediction
Multi-row gold vs single-row prediction
Gold rows with only non-numeric values (numeric_range returns 1.0)
Gold rows with mixed numeric and string columns
Very large numeric values (boundary for log-distance formula)
Negative numeric values in gold or prediction
Float vs integer comparison in numeric range (e.g., 10 vs 10.0)
None/NULL values in result tuples (stringification for value_overlap)
SQL strings that differ only by whitespace (hash should differ or normalize)
Cumulative new_info exactly at cap (0.10) -- next unique query gets 0
Cumulative step reward exactly at clamp boundary (-0.2 or +0.5)
Layer 2 called with pred_rows and gold_rows of different column counts
_bin_progress with values outside [0, 1] (e.g., negative or > 1.0 from rounding)
Concurrent episodes (if supported) -- each has independent tracking fields

6. Evidence Requirements

Category	Evidence Type	Example
Unit tests	pytest output	`uv run pytest tests/unit/test_reward.py -v` shows `X passed`
Integration	pytest output	`uv run pytest tests/integration/test_reward_flow.py -v` shows `X passed`
E2E	pytest output	`uv run pytest tests/e2e/test_reward_scenarios.py -v` shows `X passed`
Reward calibration	Logged values	Random exploration ~0.1, targeted ~0.3, correct ~1.3
Existing tests	pytest output	`uv run pytest tests/test_smoke.py -v` still passes (no regressions)