Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / docs_draft /reward_design.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 28 days ago

preview code

raw

history blame contribute delete

27.5 kB

Distance-to-Goal Metrics: Complete Reference Table

Priority-Ordered Metric Comparison

Priority	Metric Name	Effort	Value	Concept	What It Measures	Key Value	Potential Risks
1	Cardinality Matching	30 min	Very High	Compare row/value counts	"Did you return the right number of things?"	Catches over/under-retrieval early; works universally	None significant; very robust
2	Value Overlap (Set-based)	45 min	Very High	Flatten results to value sets, compute Jaccard	"How many correct values did you find?"	Rewards partial correctness; format-agnostic	May over-reward coincidental matches in large results
3	Numeric Range Proximity	1 hour	High	Logarithmic distance for numbers	"Are your numbers in the right ballpark?"	Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude	Only useful for numeric questions (~40% of dataset)
4	Row-wise Best Match	1.5 hours	High	Find best row pairing between results	"How many rows are correct, ignoring order?"	Handles ORDER BY issues; forgiving for column mismatch	Computationally expensive for large result sets (100+ rows)
5	Schema Coverage	2 hours	Medium-High	Compare tables used in queries	"Are you querying the right tables?"	Guides exploration toward relevant parts of schema	Requires SQL parsing; may reward irrelevant table access
6	Column Alignment	2.5 hours	Medium	Fuzzy match column names	"Do your result columns make semantic sense?"	Helps disambiguate multi-column results	Requires metadata tracking; fuzzy matching can be noisy
7	Rank Correlation	3 hours	Medium	Spearman correlation on ordered results	"Is your ranking correct for TOP-K queries?"	Specific to ordered results; robust to ties	Only applicable to ~20% of questions; needs scipy dependency
8	SQL Structure	6+ hours	Low-Medium	Parse query AST, compare structure	"Is your query syntactically similar?"	Might help with complex multi-join queries	HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence
9	Execution Plan	8+ hours	Low	Compare database query plans	"Does your query execute similarly?"	Theoretical value for optimization	NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness

Detailed Conceptual Explanations

🥇 Tier 1: Must-Have Metrics

1. Cardinality Matching

How It Works:

Input:  agent_result = [row1, row2, row3, row4, row5]  (5 rows)
        gold_result  = [row1, row2, row3]              (3 rows)

Calculation:
  difference = |5 - 3| = 2
  relative_error = 2 / 3 = 0.667
  reward = 1 - min(1.0, 0.667) = 0.333

Conceptual Meaning: "Getting the right number of things is the first step to getting the right things."

Value Provided:

Early signal: Before content is right, size can be right
Universal: Works for integers (1 value), lists (N values), tables (M rows)
Catches common errors:
- Missing GROUP BY → too few rows
- Cartesian join → too many rows
- Wrong aggregation → count off by orders of magnitude

Risk Analysis: ✅ Very low risk

No reward hacking opportunities (agent can't game cardinality without semantic progress)
Monotonic with correctness (better results = better cardinality)
Fast to compute (O(1))

2. Value Overlap (Set-based)

How It Works:

Question: "Which departments have >50 employees?"
Gold:  {Engineering, Sales, Marketing}
Agent: {Engineering, Sales, HR, Legal}  (2 correct, 2 wrong)

Calculation:
  intersection = {Engineering, Sales} = 2 values
  union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
  Jaccard = 2/5 = 0.4

Conceptual Meaning: "You found some of the right answers; find the rest."

Value Provided:

Partial credit: Agent gets 0.4 reward even though final answer isn't perfect
Format agnostic: Flattens everything to atomic values
- [("Engineering", 42)] → {Engineering, 42}
- [(42, "Engineering")] → {Engineering, 42} (same!)
Compositional: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0

Risk Analysis: ⚠️ Low-medium risk

Coincidental matches: If gold answer is {42} and agent returns entire employees table with 100 rows, might contain 42 by chance
Mitigation: Combine with cardinality (penalizes returning too many values)
False precision: {42} vs {42.0} vs {42.001} might all become {"42"} after string conversion

3. Numeric Range Proximity

How It Works:

Question: "Average salary in Engineering?"
Gold: 95000

Agent attempt 1: 87000 (off by 8.4%)
  relative_error = 8000 / 95000 = 0.084
  reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965

Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
  relative_error = 85500 / 95000 = 0.9
  reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721

Agent attempt 3: 950000 (10x too high)
  relative_error = 855000 / 95000 = 9.0
  reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0

Conceptual Meaning: "Being off by 10% is very different from being off by 10x."

Value Provided:

Order-of-magnitude thinking: Rewards agent for "ballpark correct" before exact
Logarithmic scale:
- 95k → 100k (5% error) = high reward (0.98)
- 95k → 190k (100% error) = medium reward (0.70)
- 95k → 950k (900% error) = no reward (0.05)
Natural for SQL: COUNT, SUM, AVG queries often close but not exact on first try

Risk Analysis: ⚠️ Medium risk

Only useful for ~40% of questions: Text/categorical answers get no benefit
Multiple numbers: If result has [42, 100, 5] and gold is [42], which number to compare?
- Solution: Use closest match for each gold number
Zero-handling: gold=0, agent=1 is infinitely far; needs special case

4. Row-wise Best Match

How It Works:

Question: "Top 3 departments by size"
Gold:   [(Engineering, 65), (Sales, 58), (Marketing, 52)]
Agent:  [(Marketing, 52), (Engineering, 65), (Sales, 58)]  # Wrong order!

Process:
  For each gold row, find best matching agent row:
    Gold row 1: (Engineering, 65)
      vs Agent row 1: (Marketing, 52)     → 0/2 match = 0.0
      vs Agent row 2: (Engineering, 65)   → 2/2 match = 1.0 ✓
      vs Agent row 3: (Sales, 58)         → 0/2 match = 0.0
      Best match: 1.0
    
    Gold row 2: (Sales, 58)
      Best match: 1.0 (agent row 3)
    
    Gold row 3: (Marketing, 52)
      Best match: 1.0 (agent row 1)
  
  Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0

Conceptual Meaning: "You got the right rows, just in the wrong order—that's still mostly correct."

Value Provided:

Order-invariant: Catches ORDER BY mistakes without penalizing heavily
Extra columns forgiven: If agent returns (Engineering, 65, 95000) and gold is (Engineering, 65), first 2 columns match → 0.67 reward
Partial row matches: Agent got department name right but count wrong → 0.5 reward per row

Risk Analysis: ⚠️ Medium risk

Computationally expensive: O(M×N) comparisons for M gold rows and N agent rows
- For 100-row results: 10,000 comparisons per reward calculation
- Mitigation: Limit to first 20 rows
Ambiguous matching: If gold has duplicate rows, which agent row should match which?

🥈 Tier 2: Nice-to-Have Metrics

5. Schema Coverage

How It Works:

Question: "How many employees in Engineering?"
Gold query:   SELECT COUNT(*) FROM employees WHERE department='Engineering'
              Tables used: {employees}

Agent query:  SELECT COUNT(*) FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              WHERE d.name='Engineering'
              Tables used: {employees, departments}

Calculation:
  intersection = {employees} = 1
  union = {employees, departments} = 2
  Jaccard = 1/2 = 0.5
  
  Penalty for extra table: 0.1 * 1 = 0.1
  Final: 0.5 - 0.1 = 0.4

Conceptual Meaning: "You're exploring the right part of the database."

Value Provided:

Exploration guidance: Early signal before query results are correct
Helps with multi-hop: "You found employees, now look at departments"
Penalizes shotgun approach: Agent that queries every table gets low reward

Risk Analysis: ⚠️ Medium-high risk

Multiple valid paths: Simple query might use 1 table, complex query uses 3—both correct
Irrelevant table penalty: What if agent explores departments first before finding employees? Gets penalized for valid exploration
Requires SQL parsing: Dependency on sqlparse library; edge cases in parsing

6. Column Alignment

How It Works:

Question: "List departments and their average salaries"
Gold columns:    [department_name, avg_salary]
Agent columns:   [dept, average_compensation]

Fuzzy matching:
  "department_name" vs "dept" 
    → SequenceMatcher = 0.65 (partial match) ✓
  
  "avg_salary" vs "average_compensation"
    → SequenceMatcher = 0.45 (weak match) ✗

Reward: 1/2 columns matched = 0.5

Conceptual Meaning: "Your columns have the right semantic meaning."

Value Provided:

Disambiguates multi-column results: If result has [42, 100, 5], which column is the answer?
Catches projection errors: Agent did SELECT * when should've done SELECT department, COUNT(*)
Fuzzy matching helps: "dept" matches "department", "emp_id" matches "employee_id"

Risk Analysis: ⚠️ High risk

Requires metadata: Need to track column names from query results (not always available in raw SQLite)
Fuzzy matching noise: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
Aliasing issues: SELECT COUNT(*) AS total vs SELECT COUNT(*) AS num_employees—both mean the same thing

7. Rank Correlation

How It Works:

Question: "Top 5 products by revenue"
Gold:   [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
        Ranks: [1, 2, 3, 4, 5]

Agent:  [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
        Ranks: [1, 3, 2, 5, 4]

Spearman correlation:
  Rank differences: [0, -1, +1, -1, +1]
  Correlation coefficient: 0.9
  
  Reward: (0.9 + 1) / 2 = 0.95

Conceptual Meaning: "You got the relative ordering mostly right."

Value Provided:

Specific to TOP-K queries: ~20% of Spider questions involve ranking
Robust to ties: Handles "tied for 2nd place" correctly
Partial credit for ordering: Top 3 correct but bottom 2 swapped → still high reward

Risk Analysis: ⚠️ Medium risk

Limited applicability: Only works for ordered results
Requires scipy: Heavy dependency just for one metric
Rank vs. value confusion: Agent might get ranking right but values wrong (or vice versa)

🥉 Tier 3: Avoid (High Risk, Low Value)

8. SQL Structure Similarity

How It Works:

Gold query:   SELECT d.name, COUNT(*) 
              FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              GROUP BY d.name

Agent query:  SELECT department, COUNT(*) 
              FROM employees 
              GROUP BY department

Structural comparison:
  Tables: {employees} vs {employees, departments} → 0.5
  Joins: {(employees, departments)} vs {} → 0.0
  Aggregates: {COUNT} vs {COUNT} → 1.0
  Group By: {d.name} vs {department} → 0.5
  
  Weighted average: 0.5

Conceptual Meaning: "Your query looks syntactically similar to the gold query."

Risk Analysis: 🛑 VERY HIGH RISK - DO NOT IMPLEMENT

Reward hacking: Agent learns to copy SQL structure without understanding semantics
Multiple valid solutions: Gold uses JOIN, agent uses subquery—both correct, but structure reward penalizes
Overfitting: Agent optimizes for "looking like gold query" instead of "getting right answer"
Research evidence: SQL-TRAIL paper found structure-based rewards hurt generalization

9. Execution Plan Similarity

How It Works:

Gold query execution plan:
  1. Scan departments (10 rows)
  2. Scan employees (1000 rows)  
  3. Hash join (O(N))
  4. Aggregate (O(N))

Agent query execution plan:
  1. Scan employees (1000 rows)
  2. Nested loop with departments (O(N²))
  3. Aggregate (O(N))

Similarity: 2/4 steps similar = 0.5

Conceptual Meaning: "Your query executes in a similar way."

Risk Analysis: 🛑 VERY HIGH RISK - DO NOT IMPLEMENT

Database-specific: SQLite plans differ from PostgreSQL plans
Doesn't guarantee correctness: Two queries with similar plans can have different results
Computationally expensive: Running EXPLAIN on every query doubles execution time
Complexity: Comparing tree structures is non-trivial
No research evidence: No prior work shows this helps RL training

🧮 Combining Multiple Metrics into Final Reward

The Challenge

You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?

Current state:
  cardinality_score = 0.8
  value_overlap_score = 0.6
  numeric_range_score = 0.9
  row_match_score = 0.7

Need: single_reward = ???

🎯 Method 1: Weighted Average (RECOMMENDED for MVP)

Formula:

def weighted_average_reward(scores, weights):
    """Simple weighted average of applicable metrics."""
    total_weight = sum(weights.values())
    return sum(scores[k] * weights[k] for k in scores) / total_weight

Example Implementation:

def compute_progress_reward(agent_result, gold_result, agent_query=None):
    # Compute all metrics
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Fixed weights (tune these!)
    weights = {
        'cardinality': 0.25,
        'value_overlap': 0.40,  # Highest weight (most universal)
        'numeric_range': 0.15,
        'row_match': 0.20,
    }
    
    return weighted_average_reward(scores, weights)

Pros:

✅ Simple to implement and understand
✅ Easy to tune (adjust weights based on training performance)
✅ Monotonic (if individual metrics improve, final reward improves)
✅ Bounded [0, 1]

Cons:

⚠️ Fixed weights might not be optimal for all question types
⚠️ Treats all metrics as equally important (regardless of context)

When to use: Default choice for MVP. Start here.

🎯 Method 2: Adaptive Weighting by Question Type

Formula:

def adaptive_weighted_reward(scores, question_metadata):
    """Adjust weights based on question characteristics."""
    
    # Detect question type
    is_numeric = has_numeric_answer(gold_result)
    is_multirow = len(gold_result) > 1
    is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query
    
    # Adaptive weights
    weights = {
        'cardinality': 0.25,  # Always important
        'value_overlap': 0.40 if not is_numeric else 0.30,
        'numeric_range': 0.30 if is_numeric else 0.0,  # Only for numeric
        'row_match': 0.20 if is_multirow else 0.10,
    }
    
    # Normalize
    total = sum(weights.values())
    weights = {k: v/total for k, v in weights.items()}
    
    return sum(scores[k] * weights[k] for k in scores)

Example:

Question: "Average salary in Engineering?" (numeric, single-value)
  → weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10

Question: "List all departments with >50 employees" (text, multi-row)
  → weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30

Pros:

✅ More accurate reward signal for different question types
✅ Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
✅ Can tune weights per question type independently

Cons:

⚠️ More complex to implement (need question type detection)
⚠️ More hyperparameters to tune
⚠️ Risk of over-engineering

When to use: If fixed weighting shows poor performance on specific question types.

🎯 Method 3: Max Pooling (Optimistic)

Formula:

def max_pooling_reward(scores):
    """Take the best metric (optimistic reward)."""
    return max(scores.values())

Example:

scores = {
    'cardinality': 0.3,  # Wrong row count
    'value_overlap': 0.8,  # Found most values!
    'numeric_range': 0.4,  # Numbers off
    'row_match': 0.5,  # Some rows match
}

reward = max(0.3, 0.8, 0.4, 0.5) = 0.8

Conceptual Meaning: "Give credit for whatever the agent did best."

Pros:

✅ Very forgiving (agent gets credit for any progress)
✅ Encourages diverse exploration strategies
✅ Simple to implement

Cons:

🛑 Too lenient: Agent might game easiest metric
🛑 Non-compositional: Doesn't reward improving multiple aspects simultaneously
🛑 Unstable gradients: Reward can jump dramatically between steps

When to use: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.

🎯 Method 4: Minimum Threshold + Average (Strict)

Formula:

def threshold_average_reward(scores, thresholds):
    """All metrics must meet threshold; then take average."""
    # Check all thresholds
    for metric, score in scores.items():
        if score < thresholds.get(metric, 0.0):
            return 0.0  # Fail if any metric below threshold
    
    # All thresholds met → return average
    return sum(scores.values()) / len(scores)

Example:

scores = {
    'cardinality': 0.9,
    'value_overlap': 0.7,
    'numeric_range': 0.3,  # Below threshold!
}

thresholds = {
    'cardinality': 0.5,
    'value_overlap': 0.5,
    'numeric_range': 0.5,
}

# numeric_range (0.3) < threshold (0.5) → return 0.0

Conceptual Meaning: "You must do reasonably well on all aspects to get any reward."

Pros:

✅ Prevents over-optimization of single metric
✅ Encourages balanced progress

Cons:

🛑 Too strict: Might be too hard for early training
🛑 Cliff dynamics: Slight improvement might not change reward at all
🛑 Threshold tuning: Requires careful calibration

When to use: If agent is gaming one metric while ignoring others. Use as penalty mechanism.

🎯 Method 5: Hierarchical (Coarse-to-Fine)

Formula:

def hierarchical_reward(scores):
    """First get cardinality right, then content, then structure."""
    
    # Layer 1: Cardinality (must be >0.5 to proceed)
    if scores['cardinality'] < 0.5:
        return scores['cardinality'] * 0.3  # Low reward, stuck at layer 1
    
    # Layer 2: Content (must be >0.5 to proceed)
    content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
    if content_score < 0.5:
        return 0.3 + content_score * 0.4  # Medium reward, stuck at layer 2
    
    # Layer 3: Structure (all metrics combined)
    final_score = (
        0.2 * scores['cardinality'] +
        0.4 * content_score +
        0.4 * scores['row_match']
    )
    return 0.5 + final_score * 0.5  # High reward, layer 3

Conceptual Meaning: "Learn to get the count right first, then the values, then the structure."

Visualization:

Reward progression:
0.0 ─────── 0.3 ─────── 0.7 ─────── 1.0
     ↑            ↑            ↑
 Cardinality  Content    Structure
   correct     correct     correct

Pros:

✅ Natural curriculum (easier tasks first)
✅ Clear progression signal
✅ Prevents agent from over-optimizing structure before content

Cons:

⚠️ More complex logic
⚠️ Requires careful threshold tuning
⚠️ Might slow down learning if thresholds too strict

When to use: If agent learns poorly with flat reward. Provides curriculum learning.

🎯 Method 6: Product (Multiplicative)

Formula:

def product_reward(scores):
    """Multiply all metrics (all must be good)."""
    product = 1.0
    for score in scores.values():
        product *= score
    return product

Example:

scores = {
    'cardinality': 0.9,
    'value_overlap': 0.8,
    'numeric_range': 0.7,
}

reward = 0.9 × 0.8 × 0.7 = 0.504

Conceptual Meaning: "All aspects must be good; weak performance on any metric drags down total."

Pros:

✅ Encourages balanced improvement
✅ Penalizes weak performance on any dimension

Cons:

🛑 Too strict: Single low score (0.1) makes entire reward near zero
🛑 Vanishing gradients: Product of small numbers becomes very small
🛑 Not bounded predictably: Can produce very small rewards even for good progress

When to use: Rarely. Only if you need extremely strict "all-or-nothing" reward.

🎯 Method 7: Percentile Aggregation (Robust)

Formula:

def percentile_reward(scores, percentile=50):
    """Use median (or other percentile) of all metrics."""
    import numpy as np
    return np.percentile(list(scores.values()), percentile)

Example:

scores = [0.9, 0.8, 0.3, 0.7, 0.6]
sorted = [0.3, 0.6, 0.7, 0.8, 0.9]

percentile_50 (median) = 0.7
percentile_75 = 0.8
percentile_25 = 0.6

Conceptual Meaning: "Reward based on typical performance, ignoring outliers."

Pros:

✅ Robust to outlier metrics (one very low or very high score doesn't dominate)
✅ Simple to implement
✅ Tunable (change percentile to be more/less strict)

Cons:

⚠️ Less interpretable than weighted average
⚠️ Ignores some information (throws away best and worst scores)

When to use: If one metric is noisy or unreliable, use median to ignore it.

📊 Comparison Table: Combination Methods

Method	Complexity	Interpretability	Robustness	Training Stability	Best Use Case
Weighted Average	Low	High	Medium	High	MVP default
Adaptive Weighting	Medium	Medium	High	High	Different question types need different signals
Max Pooling	Low	Medium	Low	Low	Agent struggling to learn anything
Threshold + Average	Medium	High	Medium	Medium	Agent gaming one metric
Hierarchical	High	High	High	Medium	Want curriculum learning
Product	Low	Low	Low	Low	All aspects must be perfect (rare)
Percentile	Low	Low	High	High	One metric is noisy/unreliable

🎬 Recommended Implementation Strategy

Phase 1: Start Simple (Week 1)

def compute_progress_reward(agent_result, gold_result):
    """Initial implementation: weighted average of 3 metrics."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
    }
    
    weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}
    
    return weighted_average_reward(scores, weights)

Why: Simple, interpretable, easy to debug.

Phase 2: Add Context (Week 2, if needed)

def compute_progress_reward(agent_result, gold_result, question_type):
    """Adaptive weighting based on question type."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Adapt weights
    if question_type == 'numeric':
        weights = {'cardinality': 0.2, 'value_overlap': 0.3, 
                   'numeric_range': 0.4, 'row_match': 0.1}
    elif question_type == 'multirow':
        weights = {'cardinality': 0.25, 'value_overlap': 0.4, 
                   'numeric_range': 0.05, 'row_match': 0.3}
    else:  # default
        weights = {'cardinality': 0.25, 'value_overlap': 0.5, 
                   'numeric_range': 0.15, 'row_match': 0.1}
    
    return weighted_average_reward(scores, weights)

Why: Improves signal quality without major complexity.

Phase 3: Add Safeguards (Week 3, if agent is gaming)

def compute_progress_reward(agent_result, gold_result, question_type):
    """Weighted average with anti-gaming measures."""
    
    scores = compute_all_scores(agent_result, gold_result)
    
    # Anti-gaming: if cardinality is way off, cap other rewards
    if scores['cardinality'] < 0.3:
        # Agent is nowhere close on size → limit credit for content
        return scores['cardinality'] * 0.5
    
    # Anti-gaming: if value overlap is low, cap row match
    if scores['value_overlap'] < 0.4:
        scores['row_match'] *= 0.5  # Penalize structure if content is wrong
    
    # Standard weighted average
    return weighted_average_reward(scores, get_weights(question_type))

Why: Prevents reward hacking while keeping interpretability.

🧪 How to Validate Your Combination Method

Create test suite:

# Test 1: Perfect match
assert compute_reward(gold, gold) == 1.0

# Test 2: Completely wrong
assert compute_reward(random_result, gold) < 0.2

# Test 3: Monotonicity (better result → higher reward)
result_v1 = partially_correct_result()  # 30% right
result_v2 = more_correct_result()       # 60% right
result_v3 = mostly_correct_result()     # 90% right
assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)

# Test 4: Bounded [0, 1]
for _ in range(100):
    random_result = generate_random_result()
    reward = compute_reward(random_result, gold)
    assert 0.0 <= reward <= 1.0

# Test 5: Insensitive to format (same values, different structure)
result_format_a = [("Engineering", 42)]
result_format_b = [(42, "Engineering")]
assert abs(compute_reward(result_format_a, gold) - 
           compute_reward(result_format_b, gold)) < 0.1  # Allow small difference

💡 Final Recommendation

For MVP (Phase 1-3):

Use Method 1: Weighted Average with metrics #1-3 (cardinality, value overlap, numeric range)
Fixed weights: {0.25, 0.50, 0.25}

If training shows issues:

Add Method 2: Adaptive Weighting based on question type
Add metric #4 (row-wise match)

If agent games rewards:

Add threshold checks from Method 4
Add anti-gaming logic from Phase 3 example

Never use:

Method 3 (Max Pooling) - too gameable
Method 6 (Product) - too strict, vanishing gradients
Metrics #8-9 (SQL structure, execution plans) - research shows they hurt

Bottom line: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.